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Collaborative Statistics: Summary of 
Modifications by R. Bloom 



This module summarizes the modifications made by Roberta Bloom to the modules included in 
the custom textbook collection Collaborative Statistics by R. Bloom http://cnx.org/content/coll0617/ 
2 http://cnx.org/content/coll0617/ and Homework Book for Collaborative Statistics by R. Bloom 
http://cnx.org/content/coll0619/ 3 These custom collections are based on the textbook collection Collabo- 
rative Statistics, by Illowsky B.and S.Dean, Connexions Web site, http://cnx.Org/content/coll0522/l.29/, 
Dec 5, 2008, but have been modified, as detailed in this module. If future modifications are made to the 
custom collection, this module will be updated to contain current information. 

IMPORTANT NOTE TO STUDENTS OWNING A PRINT COPY OF THIS TEXTBOOK: 

This custom version of the Collaborative Statistics textbook by Susan Dean and Barbara Illowsky has been 
modified by Roberta Bloom. The sections that are different in this book are listed below, by title, along with 
a description of the changes. Section numbers and page numbers may be different also, but the section 
titles should correspond to the sections in the original Dean/Illowsky textbook. 

If you are using a print copy of this textbook and your class instructor is NOT Ms. Bloom: 

• you need to be aware of the textbook changes listed below 

• you may need to go to Collaborative Statistics collection by S. Dean and B. Illowsky ( 
http://cnx.org/content/coll0522/latest/ ) 4 to view or print out the versions of the listed sections 
as included in the Dean/Illowsky collection for this textbook so that you have the same version as the 
rest of your class. 

The custom collection for Ms. Bloom's class has divided the textbook into two collections: 

• Homework Collection Collaborative Statistics Homework Collection (modified R. Bloom) ( 
http://cnx.org/content/coll0619/latest/ ) 5 which contains the formula summary page, the home- 
work problems and the review problems 

• Textbook Collection Collaborative Statistics (custom collection modified R. Bloom) ( 
http://cnx.org/content/coll0617/latest/ ) 6 which contains the text and the chapter practices, 
but not the homework or review problems. 

• YOU NEED TO USE BOTH THE HOMEWORK COLLECTION AND THE TEXTBOOK COLLEC- 
TION FOR MS. BLOOM's CLASS. 

List of Modifications: 



lr rhis content is available online at <http://cnx.Org/content/ml8941/l.4/>. 

2 Collaborative Statistics: Custom Version modified by R. Bloom <http://cnx.org/content/coll0617/latest/> 

3 Collaborative Statistics Homework Book: Custom Version modified by R. Bloom <http://cnx.org/content/coll0619/latest/> 

4 Collaborative Statistics <http://cnx.org/content/coll0522/latest/> 

5 Collaborative Statistics Homework Book: Custom Version modified by R. Bloom <http://cnx.org/content/coll0619/latest/> 

6 Collaborative Statistics: Custom Version modified by R. Bloom <http://cnx.org/content/coll0617/latest/> 



Labs and Projects Removed 

The labs and projects have been removed from this modified version of Collaborative Statistics. Ms. Bloom 
posts the labs (and projects, if any) for her class on her class website. If you are using this book with another 
instructor you may need to access and print out the labs or projects online from the original Dean/Illowsky 
Collaborative Statistics textbook collection: http://cnx.org/content/coll0522/latest 

Chapter 1 Data and Sampling 

• Homework: two new homework problems have been added 

Chapter 2 Descriptive Statistics 

• Measuring the Spread of the Data: some revisions in wording and use of symbols; introduced ter- 
minology for z-score; formulas added; brief summary of Chebyshev's Rule and Empirical Rule have 
been added. 

• Practice 3: Interpreting Percentiles: new section added that was not included in the original textbook 

• Homework: Some new homework questions have been added 



Chapter 3: Probability Topics 

• Terminology: wording revisions; discussion of the Law of Large Numbers added 

• Independent and Mutually Exclusive Events: wording revisions; add additional worked example has 
been added to illustrate determining that two events are not independent 

• Contingency Tables: one example was removed from this section 

• Practice 1 : the data has been presented in tabular form 

• Homework: new problems #33 through #41 have been added 

Chapter 4: Discrete Distributions 

• Homework: new problems #38 through #43 have been added 

Chapter 5: Continuous Probability Distributions 

• Introduction to Continuous Random Variables: some material has been removed 

• Properties of Continuous Random Variables: the concepts of probability as area, including graphs 
illustrating this concept; probability density functions, and cumulative distribution functions are ex- 
plained 

• Uniform Distribution: Example 4 is new, illustrating the uniform distribution when the minimum 
value is not 0. This replaces the example for conditional probability that had been in the original 
module. 

• Practice 1: Uniform Distribution: Problems pertaining to conditional probability have been removed 
from this section 

Chapter 6: The Normal Distribution 

• No changes have been made to this chapter 

Chapter 7: Central Limit Theorem 

• The section for CLT for Sums has been retitled as OPTIONAL 

• Using the Central Limit Theorem : Examples illustrating use of the CLT for sums have been removed 
from this section. 

• Practice: problems pertaining to the CLT for sums were removed from the practice 



• Homework: Several homework exercises pertaining to the CLT for sums have been removed. Several 
homework exercises have been changed to reflect only the CLT for means: parts pertaining to the CLT 
for sums were replaced with new parts or removed. Specific changes made: Exercise #1 parts g,i and 
exercise #15 parts d,e,i and exercise #6 have been removed. Exercises #7, #11, #13, #22, #23 have been 
modified. Exercise #24 is new. 

• Review: the context of the last problem has been changed 

Chapter 8: Confidence Interval Estimates 

• Confidence Interval for an Unknown Population Mean, Population Standard Deviation Known: Nor- 
mal Distribution: worked examples have modified to show the step by step solution using the error 
bound formulas to find the confidence intervals; revised module may also contain some revisions in 
wording. 

• Confidence Interval for an Unknown Population Mean, Population Standard Deviation Known: Nor- 
mal Distribution: An additional example has been added to the module illustrating how to find the 
mean and the error bound when only the confidence interval is given. 

• Confidence Interval for an Unknown Population Mean, Population Standard Deviation Unknown: 
Student t Distribution: worked examples have modified to show the step by step solution using the 
error bound formulas to find the confidence intervals; revised module may also contain some revi- 
sions in wording. 

• Confidence Interval for an Unknown Population Proportion: worked examples have modified to 
show the step by step solution using the error bound formulas to find the confidence intervals; revised 
module may also contain some revisions in wording. 

• In all 3 sections listed above, the emphasis has been changed to using the error bound formulas to 
calculate the confidence interval rather than reliance on the calculators' interval functions to find the 
confidence interval. 

Chapter 9: Hypothesis Test for a Single Mean or Single Proportion 

• Homework: Some homework exercises have been omitted in this revision; the numbering remains 
unchanged for the remaining exercises. The omitted exercises are indicated in the section. 

Chapter 10: Hypothesis Testing: Two Means, Paired Data, Two Proportions 

• At this time, no changes have been made in this chapter. 

Chapter 11: THe Chi-Square Distribution 

• No changes have been made in this chapter. 

Chapter 12: Linear Regression and Correlation 

• Sections 12.5 (The Regression Equation), 12.6 (Correlation Coefficient and Coefficient of Determina- 
tion), 12.7 (Testing the Significance of the Correlation Coefficient), 12.8 (Prediction), 12.9 Outliers have 
been modified 

• Section 12.5 now contains calculator instructions for the LinRegTTest for the TI-83,83+,84+ calculators 

• Section 12.6 includes the coefficient of determination that was not included in the original section and 
includes some material that was originally in section 12.7 

• Section 12.7 now includes both the p-value approach and critical value approach to testing the sig- 
nificance of the correlation coefficient. It also contains additional information about the assumptions 
underlying the test of significance. Some material originally contained in section 12.7 has been moved 
forward to section 12.6 

• Section 12.9 includes a graphical method of identifying outliers, in addition to the numerical method 
included in the original version of this section 



• In the Homework Section, 2 new problems have been added. 

Chpater 13: F Distribution and ANOVA 

• No changes have been made in this chapter. 

Hypothesis Test Solution Sheets 

• Hypothesis test solution sheets for chapters 9, 10, 11, 13 will be modified and will be made available, 
appropriately formatted, via links on Ms. Bloom's class website. 

• The original solution sheets will be removed from the online collection. 

Practice Final Exams 

• Practice Final Exams have been removed from this collection 

• Instead, Practice Final Exams will be made available via links on Ms. Bloom's class website 



Preface by S. Dean and B. Illowsky 



Welcome to Collaborative Statistics, presented by Connexions. The initial section below introduces you to 
Connexions. If you are familiar with Connexions, please skip to About "Collaborative Statistics." (Section : 
About Connexions) 

About Connexions 

Connexions Modular Content 

Connexions (cnx.org 8 ) is an online, open access educational resource dedicated to providing high quality 
learning materials free online, free in printable PDF format, and at low cost in bound volumes through 
print-on-demand publishing. The Collaborative Statistics textbook is one of many collections available 
to Connexions users. Each collection is composed of a number of re-usable learning modules written in 
the Connexions XML markup language. Each module may also be re-used (or 're-purposed') as part of 
other collections and may be used outside of Connexions. Including Collaborative Statistics, Connexions 
currently offers over 6500 modules and more than 350 collections. 

The modules of Collaborative Statistics are derived from the original paper version of the textbook under 
the same title, Collaborative Statistics. Each module represents a self-contained concept from the original 
work. Together, the modules comprise the original textbook. 

Re-use and Customization 

The Creative Commons (CC) Attribution license 9 applies to all Connexions modules. Under this license, 
any module in Connexions may be used or modified for any purpose as long as proper attribution to the 
original author(s) is maintained. Connexions' authoring tools make re-use (or re-purposing) easy. There- 
fore, instructors anywhere are permitted to create customized versions of the Collaborative Statistics text- 
book by editing modules, deleting unneeded modules, and adding their own supplementary modules. 
Connexions' authoring tools keep track of these changes and maintain the CC license's required attribution 
to the original authors. This process creates a new collection that can be viewed online, downloaded as a 
single PDF file, or ordered in any quantity by instructors and students as a low-cost printed textbook. To 
start building custom collections, please visit the help page, "Create a Collection with Existing Modules" 10 
. For a guide to authoring modules, please look at the help page, "Create a Module in Minutes" 11 . 

Read the book online, print the PDF, or buy a copy of the book. 

To browse the Collaborative Statistics textbook online, visit the collection home page at 
cnx.org/content/coll0522/latest 12 . You will then have three options. 



7 This content is available online at <http://cnx.Org/content/ml6026/l.16/>. 

8 http://cnx.org/ 

9 http://creativecommons.org/licenses/by/2.0/ 
10 http://cnx.org/help/CreateCollection 
n http://cnx.org/help/ModuleInMinutes 
12 Collaborative Statistics <http://cnx.org/content/coll0522/latest/> 



1. You may obtain a PDF of the entire textbook to print or view offline by clicking on the "Download 
PDF" link in the "Content Actions" box. 

2. You may order a bound copy of the collection by clicking on the "Order Printed Copy" button. 

3. You may view the collection modules online by clicking on the "Start 3>" link, which takes you to the 
first module in the collection. You can then navigate through the subsequent modules by using their 
"Next 3>" and "Previous 3>" links to move forward and backward in the collection. You can jump to 
any module in the collection by clicking on that module's title in the "Collection Contents" box on the 
left side of the window. If these contents are hidden, make them visible by clicking on "[show table 
of contents]". 

Accessibility and Section 508 Compliance 

• For information on general Connexions accessibility features, please visit 
http://cnx.org/content/ml7212/latest/ 13 . 

• For information on accessibility features specific to the Collaborative Statistics textbook, please visit 
http://cnx.org/content/ml7211/latest/ 14 . 

Version Change History and Errata 

• For a list of modifications, updates, and corrections, please visit 
http://cnx.org/content/ml7360/latest/ 15 . 

Adoption and Usage 

• The Collaborative Statistics collection has been adopted and customized by a number of profes- 
sors and educators for use in their classes. For a list of known versions and adopters, please visit 
http://cnx.org/content/ml8261/latest/ 16 . 

About "Collaborative Statistics" 

Collaborative Statistics was written by Barbara Illowsky and Susan Dean, faculty members at De Anza Col- 
lege in Cupertino, California. The textbook was developed over several years and has been used in regular 
and honors-level classroom settings and in distance learning classes. Courses using this textbook have been 
articulated by the University of California for transfer of credit. The textbook contains full materials for 
course offerings, including expository text, examples, labs, homework, and projects. A Teacher's Guide is 
currently available in print form and on the Connexions site at http://cnx.org/content/coll0547/latest/ , 
and supplemental course materials including additional problem sets and video lectures are available at 
http://cnx.org/content/coll0586/latest/ 18 . The on-line text for each of these collections collections will 
meet the Section 508 standards for accessibility. 

An on-line course based on the textbook was also developed by Illowsky and Dean. It has won an award 
as the best on-line California community college course. The on-line course will be available at a later date 
as a collection in Connexions, and each lesson in the on-line course will be linked to the on-line textbook 
chapter. The on-line course will include, in addition to expository text and examples, videos of course 
lectures in captioned and non-captioned format. 

The original preface to the book as written by professors Illowsky and Dean, now follows: 



13 "Accessibility Features of Connexions" <http://cnx.org/content/ml7212/latest/> 

14 "Collaborative Statistics: Accessibility" <http://cnx.org/content/ml7211/latest/> 

15 "Collaborative Statistics: Change History" <http://cnx.org/content/ml7360/latest/> 

16 "Collaborative Statistics: Adoption and Usage" <http://cnx.org/content/ml8261/latest/> 

17 Collaborative Statistics Teacher's Guide <http://cnx.org/content/coll0547/latest/> 

18 Collaborative Statistics: Supplemental Course Materials <http://cnx.org/content/coll0586/latest/> 



This book is intended for introductory statistics courses being taken by students at two- and four-year 
colleges who are majoring in fields other than math or engineering. Intermediate algebra is the only pre- 
requisite. The book focuses on applications of statistical knowledge rather than the theory behind it. The 
text is named Collaborative Statistics because students learn best by doing. In fact, they learn best by 
working in small groups. The old saying "two heads are better than one" truly applies here. 

Our emphasis in this text is on four main concepts: 



• 



thinking statistically 
incorporating technology 
working collaboratively 
writing thoughtfully 



These concepts are integral to our course. Students learn the best by actively participating, not by just 
watching and listening. Teaching should be highly interactive. Students need to be thoroughly engaged 
in the learning process in order to make sense of statistical concepts. Collaborative Statistics provides 
techniques for students to write across the curriculum, to collaborate with their peers, to think statistically, 
and to incorporate technology. 

This book takes students step by step. The text is interactive. Therefore, students can immediately apply 
what they read. Once students have completed the process of problem solving, they can tackle interesting 
and challenging problems relevant to today's world. The problems require the students to apply their 
newly found skills. In addition, technology (TI-83 graphing calculators are highlighted) is incorporated 
throughout the text and the problems, as well as in the special group activities and projects. The book also 
contains labs that use real data and practices that lead students step by step through the problem solving 
process. 

At De Anza, along with hundreds of other colleges across the country, the college audience involves a 
large number of ESL students as well as students from many disciplines. The ESL students, as well as 
the non-ESL students, have been especially appreciative of this text. They find it extremely readable and 
understandable. Collaborative Statistics has been used in classes that range from 20 to 120 students, and in 
regular, honor, and distance learning classes. 

Susan Dean 

Barbara Illowsky 
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Additional Resources Currently Available 

• Glossary (Glossary, p. 9) 

• View or Download This Textbook Online (View or Download This Textbook Online, p. 9) 

• Collaborative Statistics Teacher's Guide (Collaborative Statistics Teacher's Guide, p. 9) 

• Supplemental Materials (Supplemental Materials, p. 9) 

• Video Lectures (Video Lectures, p. 10) 

• Version History (Version History, p. 10) 

• Textbook Adoption and Usage (Textbook Adoption and Usage, p. 10) 

• Additional Technologies and Notes (Additional Technologies, p. 10) 

• Accessibility and Section 508 Compliance (Accessibility and Section 508 Compliance, p. 10) 

The following section describes some additional resources for learners and educators. These modules and 
collections are all available on the Connexions website (http://cnx.org/ 20 ) and can be viewed online, 
downloaded, printed, or ordered as appropriate. 

Glossary 

This module contains the entire glossary for the Collaborative Statistics textbook collection (coll0522) since 
its initial release on 15 July 2008. The glossary is located at http://cnx.org/content/ml6129/latest/ 21 . 

View or Download This Textbook Online 

The complete contents of this book are available at no cost on the Connexions website at 
http://cnx.org/content/coll0522/latest/ 22 . Anybody can view this content free of charge either as an 
online e-book or a downloadable PDF file. A low-cost printed version of this textbook is also available 
here 23 . 

Collaborative Statistics Teacher's Guide 

A complementary Teacher's Guide for Collaborative statistics is available through Connexions at 
http://cnx.org/content/coll0547/latest/ 24 . The Teacher's Guide includes suggestions for presenting con- 
cepts found throughout the book as well as recommended homework assignments. A low-cost printed 
version of this textbook is also available here 25 . 

Supplemental Materials 

This companion to Collaborative Statistics provides a number of additional resources for use by students 
and instructors based on the award winning Elementary Statistics Sofia online course 26 , also by textbook 



19 This content is available online at <http://cnx.Org/content/ml8746/l.6/>. 

20 http://cnx.org/ 

21 "Collaborative Statistics: Glossary" <http://cnx.org/content/ml6129/latest/> 

22 Collaborative Statistics <http://cnx.org/content/coll0522/latest/> 

23 http://my.qoop.com/store/7064943342106149/7781159220340 

24 Collaborative Statistics Teacher's Guide <http://cnx.org/content/coll0547/latest/> 

25 http://my.qoop.com/store/7064943342106149/8791310589747 

26 http://sofia.fhda.edu/gallery/statistics/index.html 



10 

authors Barbara Illowsky and Susan Dean. This content is designed to complement the textbook by provid- 
ing video tutorials, course management materials, and sample problem sets. The Supplemental Materials 
collection can be found at http://cnx.org/content/coll0586/latest/ 27 . 

Video Lectures 



Video Lecture 1 
Video Lecture 2 
Video Lecture 3 
Video Lecture 4 
Video Lecture 5 
Video Lecture 6 
Video Lecture 7 
Video Lecture 8 
Video Lecture 9 



Sampling and Data 28 



Descriptive Statistics 29 

Probability Topics 30 

Discrete Distributions 31 

Continuous Random Variables 32 

The Normal Distribution 33 

The Central Limit Theorem 34 

Confidence Intervals 35 

Hypothesis Testing with a Single Mean 36 

• Video Lecture 10: Hypothesis Testing with Two Means 37 

• Video Lecture 1 1 : The Chi-Square Distribution 38 

• Video Lecture 12: Linear Regression and Correlation 39 

Version History 

This module contains a listing of changes, updates, and corrections made to the Collaborative Statistics 
textbook collection (coll0522) since its initial release on 15 July 2008. The Version History is located at 
http://cnx.org/content/ml7360/latest/ 40 . 

Textbook Adoption and Usage 

This module is designed to track the various derivations of the Collaborative Statistics textbook and its 
various companion resources, as well as keep track of educators who have adopted various versions for 
their courses. New adopters are encouraged to provide their contact information and describe how they 
will use this book for their courses. The goal is to provide a list that will allow educators using this book 
to collaborate, share ideas, and make suggestions for future development of this text. The Adoption and 
Usage module is located at http://cnx.org/content/ml8261/latest/ 41 . 

Additional Technologies 

In order to provide the most flexible learning resources possible, we invite collaboration from all instructors 
wishing to create customized versions of this content for use with other technologies. For instance, you may 
be interested in creating a set of instructions similar to this collection's calculator notes. If you would like to 
contribute to this collection, please use the contact the authors with any ideas or materials you have created. 

Accessibility and Section 508 Compliance 



Collaborative Statistics: Supplemental Course Materials <http://cnx.org/content/coll0586/latest/> 
Elementary Statistics: Video Lecture - Sampling and Data" <http://cnx.org/content/ml7561/latest/> 
Elementary Statistics: Video Lecture - Descriptive Statistics" <http://cnx.org/content/ml7562/latest/> 
Elementary Statistics: Video Lecture - Probability Topics" <http://cnx.org/content/ml7563/latest/> 
Elementary Statistics: Video Lecture - Discrete Distributions" <http://cnx.org/content/ml7565/latest/> 
Elementary Statistics: Video Lecture - Continuous Random Variables" <http://cnx.org/content/ml7566/latest/> 
Elementary Statistics: Video Lecture - The Normal Distribution" <http://cnx.org/content/ml7567/latest/> 
34 "Elementary Statistics: Video Lecture - The Central Limit Theorem" <http://cnx.org/content/ml7568/latest/> 
Elementary Statistics: Video Lecture - Confidence Intervals" <http://cnx.org/content/ml7569/latest/> 
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Student Welcome Letter 



Dear Student: 

Have you heard others say, "You're taking statistics? That's the hardest course I ever took!" They say that, 
because they probably spent the entire course confused and struggling. They were probably lectured to 
and never had the chance to experience the subject. You will not have that problem. Let's find out why. 

There is a Chinese Proverb that describes our feelings about the field of statistics: 

I HEAR, AND I FORGET 

I SEE, AND I REMEMBER 

I DO, AND I UNDERSTAND 

Statistics is a "do" field. In order to learn it, you must "do" it. We have structured this book so that you will 
have hands-on experiences. They will enable you to truly understand the concepts instead of merely going 
through the requirements for the course. 

What makes this book different from other texts? First, we have eliminated the drudgery of tedious cal- 
culations. You might be using computers or graphing calculators so that you do not need to struggle with 
algebraic manipulations. Second, this course is taught as a collaborative activity. With others in your class, 
you will work toward the common goal of learning this material. 

Here are some hints for success in your class: 



• Work hard and work every night. 

• Form a study group and learn together. 

• Don't get discouraged - you can do it! 

• As you solve problems, ask yourself, "Does this answer make sense?" 

• Many statistics words have the same meaning as in everyday English. 

• Go to your teacher for help as soon as you need it. 

• Don't get behind. 

• Read the newspaper and ask yourself, "Does this article make sense?" 

• Draw pictures - they truly help! 

Good luck and don't give up! 

Sincerely, 

Susan Dean and Barbara Illowsky 

45 This content is available online at <http://cnx.Org/content/ml6305/l.5/>. 
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De Anza College 

21250 Stevens Creek Blvd. 

Cupertino, California 95014 



Chapter 1 

Sampling and Data 

1.1 Sampling and Data 1 
1.1.1 Student Learning Outcomes 

By the end of this chapter, the student should be able to: 

• Recognize and differentiate between key terms. 

• Apply various types of sampling methods to data collection. 

• Create and interpret frequency tables. 



1.1.2 Introduction 

You are probably asking yourself the question, "When and where will I use statistics?". If you read any 
newspaper or watch television, or use the Internet, you will see statistical information. There are statistics 
about crime, sports, education, politics, and real estate. Typically, when you read a newspaper article or 
watch a news program on television, you are given sample information. With this information, you may 
make a decision about the correctness of a statement, claim, or "fact." Statistical methods can help you make 
the "best educated guess." 

Since you will undoubtedly be given statistical information at some point in your life, you need to know 
some techniques to analyze the information thoughtfully. Think about buying a house or managing a 
budget. Think about your chosen profession. The fields of economics, business, psychology, education, 
biology, law, computer science, police science, and early childhood development require at least one course 
in statistics. 

Included in this chapter are the basic ideas and words of probability and statistics. You will soon under- 
stand that statistics and probability work together. You will also learn how data are gathered and what 
"good" data are. 

1.2 Statistics 2 

The science of statistics deals with the collection, analysis, interpretation, and presentation of data. We see 
and use data in our everyday lives. 



lr rhis content is available online at <http://cnx.Org/content/ml6008/l.9/>. 
2 This content is available online at <http://cnx.Org/content/ml6020/l.14/>. 
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1.2.1 Optional Collaborative Classroom Exercise 

In your classroom, try this exercise. Have class members write down the average time (in hours, to the 
nearest half-hour) they sleep per night. Your instructor will record the data. Then create a simple graph 
(called a dot plot) of the data. A dot plot consists of a number line and dots (or points) positioned above 
the number line. For example, consider the following data: 

5; 5.5; 6; 6; 6; 6.5; 6.5; 6.5; 6.5; 7; 7; 8; 8; 9 

The dot plot for this data would be as follows: 

Frequency of Average Time (in Hours) Spent Sleeping per Night 
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Figure 1.1 



Does your dot plot look the same as or different from the example? Why? If you did the same example in 
an English class with the same number of students, do you think the results would be the same? Why or 
why not? 

Where do your data appear to cluster? How could you interpret the clustering? 

The questions above ask you to analyze and interpret your data. With this example, you have begun your 
study of statistics. 

In this course, you will learn how to organize and summarize data. Organizing and summarizing data is 
called descriptive statistics. Two ways to summarize data are by graphing and by numbers (for example, 
finding an average). After you have studied probability and probability distributions, you will use formal 
methods for drawing conclusions from "good" data. The formal methods are called inferential statistics. 
Statistical inference uses probability to determine how confident we can be that the conclusions are correct. 

Effective interpretation of data (inference) is based on good procedures for producing data and thoughtful 
examination of the data. You will encounter what will seem to be too many mathematical formulas for 
interpreting data. The goal of statistics is not to perform numerous calculations using the formulas, but to 
gain an understanding of your data. The calculations can be done using a calculator or a computer. The 
understanding must come from you. If you can thoroughly grasp the basics of statistics, you can be more 
confident in the decisions you make in life. 
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1.3 Probability 3 

Probability is a mathematical tool used to study randomness. It deals with the chance (the likelihood) of 
an event occurring. For example, if you toss a fair coin 4 times, the outcomes may not be 2 heads and 2 
tails. However, if you toss the same coin 4,000 times, the outcomes will be close to half heads and half tails. 
The expected theoretical probability of heads in any one toss is i or 0.5. Even though the outcomes of a 
few repetitions are uncertain, there is a regular pattern of outcomes when there are many repetitions. After 
reading about the English statistician Karl Pearson who tossed a coin 24,000 times with a result of 12,012 
heads, one of the authors tossed a coin 2,000 times. The results were 996 heads. The fraction Jjj|j is equal 
to 0.498 which is very close to 0.5, the expected probability. 

The theory of probability began with the study of games of chance such as poker. Predictions take the form 
of probabilities. To predict the likelihood of an earthquake, of rain, or whether you will get an A in this 
course, we use probabilities. Doctors use probability to determine the chance of a vaccination causing the 
disease the vaccination is supposed to prevent. A stockbroker uses probability to determine the rate of 
return on a client's investments. You might use probability to decide to buy a lottery ticket or not. In your 
study of statistics, you will use the power of mathematics through probability calculations to analyze and 
interpret your data. 

1.4 Key Terms 4 

In statistics, we generally want to study a population. You can think of a population as an entire collection 
of persons, things, or objects under study. To study the larger population, we select a sample. The idea of 
sampling is to select a portion (or subset) of the larger population and study that portion (the sample) to 
gain information about the population. Data are the result of sampling from a population. 

Because it takes a lot of time and money to examine an entire population, sampling is a very practical 
technique. If you wished to compute the overall grade point average at your school, it would make sense 
to select a sample of students who attend the school. The data collected from the sample would be the 
students' grade point averages. In presidential elections, opinion poll samples of 1,000 to 2,000 people are 
taken. The opinion poll is supposed to represent the views of the people in the entire country. Manu- 
facturers of canned carbonated drinks take samples to determine if a 16 ounce can contains 16 ounces of 
carbonated drink. 

From the sample data, we can calculate a statistic. A statistic is a number that is a property of the sample. 
For example, if we consider one math class to be a sample of the population of all math classes, then the 
average number of points earned by students in that one math class at the end of the term is an example of 
a statistic. The statistic is an estimate of a population parameter. A parameter is a number that is a property 
of the population. Since we considered all math classes to be the population, then the average number of 
points earned per student over all the math classes is an example of a parameter. 

One of the main concerns in the field of statistics is how accurately a statistic estimates a parameter. The 
accuracy really depends on how well the sample represents the population. The sample must contain the 
characteristics of the population in order to be a representative sample. We are interested in both the 
sample statistic and the population parameter in inferential statistics. In a later chapter, we will use the 
sample statistic to test the validity of the established population parameter. 

A variable, notated by capital letters like X and Y, is a characteristic of interest for each person or thing in 
a population. Variables may be numerical or categorical. Numerical variables take on values with equal 

3 This content is available online at <http://cnx.Org/content/ml6015/l.ll/>. 
4 This content is available online at <http://cnx.Org/content/ml6007/l.16/>. 
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units such as weight in pounds and time in hours. Categorical variables place the person or thing into a 
category. If we let X equal the number of points earned by one math student at the end of a term, then X 
is a numerical variable. If we let Y be a person's party affiliation, then examples of Y include Republican, 
Democrat, and Independent. Y is a categorical variable. We could do some math with values of X (calculate 
the average number of points earned, for example), but it makes no sense to do math with values of Y 
(calculating an average party affiliation makes no sense). 

Data are the actual values of the variable. They may be numbers or they may be words. Datum is a single 
value. 

Two words that come up often in statistics are mean and proportion. If you were to take three exams in 
your math classes and obtained scores of 86, 75, and 92, you calculate your mean score by adding the three 
exam scores and dividing by three (your mean score would be 84.3 to one decimal place). If, in your math 
class, there are 40 students and 22 are men and 18 are women, then the proportion of men students is 
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40 



and the proportion of women students is |§ . Mean and proportion are discussed in more detail in later 
chapters. 

NOTE: The words "mean" and "average" are often used interchangeably. The substitution of one 
word for the other is common practice. The technical term is "arithmetic mean" and "average" is 
technically a center location. However, in practice among non-statisticians, "average" is commonly 
accepted for "arithmetic mean." 

Example 1.1 

Define the key terms from the following study: We want to know the average amount of money 
first year college students spend at ABC College on school supplies that do not include books. We 
randomly survey 100 first year students at the college. Three of those students spent $150, $200, 
and $225, respectively. 

Solution 

The population is all first year students attending ABC College this term. 

The sample could be all students enrolled in one section of a beginning statistics course at ABC 
College (although this sample may not represent the entire population). 

The parameter is the average amount of money spent (excluding books) by first year college stu- 
dents at ABC College this term. 

The statistic is the average amount of money spent (excluding books) by first year college students 
in the sample. 

The variable could be the amount of money spent (excluding books) by one first year student. 
Let X = the amount of money spent (excluding books) by one first year student attending ABC 
College. 

The data are the dollar amounts spent by the first year students. Examples of the data are $150, 
$200, and $225. 



1.4.1 Optional Collaborative Classroom Exercise 

Do the following exercise collaboratively with up to four people per group. Find a population, a sample, 
the parameter, the statistic, a variable, and data for the following study: You want to determine the average 
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number of glasses of milk college students drink per day. Suppose yesterday, in your English class, you 
asked five students how many glasses of milk they drank the day before. The answers were 1,0, 1, 3, and 4 
glasses of milk. 

1.5 Data 5 

Data may come from a population or from a sample. Small letters like x or y generally are used to represent 
data values. Most data can be put into the following categories: 

• Qualitative 

• Quantitative 

Qualitative data are the result of categorizing or describing attributes of a population. Hair color, blood 
type, ethnic group, the car a person drives, and the street a person lives on are examples of qualitative data. 
Qualitative data are generally described by words or letters. For instance, hair color might be black, dark 
brown, light brown, blonde, gray, or red. Blood type might be AB+, O-, or B+. Researchers often prefer to 
use quantitative data over qualitative data because it lends itself more easily to mathematical analysis. For 
example, it does not make sense to find an average hair color or blood type. 

Quantitative data are always numbers. Quantitative data are the result of counting or measuring attributes 
of a population. Amount of money, pulse rate, weight, number of people living in your town, and the 
number of students who take statistics are examples of quantitative data. Quantitative data may be either 
discrete or continuous. 

All data that are the result of counting are called quantitative discrete data. These data take on only certain 
numerical values. If you count the number of phone calls you receive for each day of the week, you might 
get 0, 1, 2, 3, etc. 

All data that are the result of measuring are quantitative continuous data assuming that we can measure 
accurately. Measuring angles in radians might result in the numbers j, j ,j , n , ^ , etc. If you and your 
friends carry backpacks with books in them to school, the numbers of books in the backpacks are discrete 
data and the weights of the backpacks are continuous data. 

Example 1.2: Data Sample of Quantitative Discrete Data 

The data are the number of books students carry in their backpacks. You sample five students. 
Two students carry 3 books, one student carries 4 books, one student carries 2 books, and one 
student carries 1 book. The numbers of books (3, 4, 2, and 1) are the quantitative discrete data. 

Example 1.3: Data Sample of Quantitative Continuous Data 

The data are the weights of the backpacks with the books in it. You sample the same five students. 
The weights (in pounds) of their backpacks are 6.2, 7, 6.8, 9.1, 4.3. Notice that backpacks carrying 
three books can have different weights. Weights are quantitative continuous data because weights 
are measured. 

Example 1.4: Data Sample of Qualitative Data 

The data are the colors of backpacks. Again, you sample the same five students. One student has 
a red backpack, two students have black backpacks, one student has a green backpack, and one 
student has a gray backpack. The colors red, black, black, green, and gray are qualitative data. 

NOTE: You may collect data as numbers and report it categorically. For example, the quiz scores 
for each student are recorded throughout the term. At the end of the term, the quiz scores are 
reported as A, B, C, D, or F. 



5 This content is available online at <http://cnx.Org/content/ml6005/l.15/>. 
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Example 1.5 

Work collaboratively to determine the correct data type (quantitative or qualitative). Indicate 
whether quantitative data are continuous or discrete. Hint: Data that are discrete often start with 
the words "the number of." 

1 . The number of pairs of shoes you own. 

2. The type of car you drive. 

3. Where you go on vacation. 

4. The distance it is from your home to the nearest grocery store. 

5. The number of classes you take per school year. 

6. The tuition for your classes 

7. The type of calculator you use. 

8. Movie ratings. 

9. Political party preferences. 

10. Weight of sumo wrestlers. 

11. Amount of money (in dollars) won playing poker. 

12. Number of correct answers on a quiz. 

13. Peoples' attitudes toward the government. 

14. IQ scores. (This may cause some discussion.) 



1.6 Sampling 6 

Gathering information about an entire population often costs too much or is virtually impossible. Instead, 
we use a sample of the population. A sample should have the same characteristics as the population it 
is representing. Most statisticians use various methods of random sampling in an attempt to achieve this 
goal. This section will describe a few of the most common methods. 

There are several different methods of random sampling. In each form of random sampling, each member 
of a population initially has an equal chance of being selected for the sample. Each method has pros and 
cons. The easiest method to describe is called a simple random sample. Any group of n individuals is 
equally likely to be chosen by any other group of n individuals if the simple random sampling technique is 
used. In other words, each sample of the same size has an equal chance of being selected. For example, sup- 
pose Lisa wants to form a four-person study group (herself and three other people) from her pre-calculus 
class, which has 31 members not including Lisa. To choose a simple random sample of size 3 from the other 
members of her class, Lisa could put all 31 names in a hat, shake the hat, close her eyes, and pick out 3 
names. A more technological way is for Lisa to first list the last names of the members of her class together 
with a two-digit number as shown below. 



6 This content is available online at <http://cnx.Org/content/ml6014/l.17/>. 
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Class Roster 



ID 


Name 


00 


Anselmo 


01 


Bautista 


02 


Bayani 


03 


Cheng 


04 


Cuarismo 


05 


Cuningham 


06 


Fontecha 


07 


Hong 


08 


Hoobler 


09 


Jiao 


10 


Khan 


11 


King 


12 


Legeny 


13 


Lundquist 


14 


Macierz 


15 


Motogawa 


16 


Okimoto 


17 


Patel 


18 


Price 


19 


Quizon 


20 


Reyes 


21 


Roquero 


22 


Roth 


23 


Rowell 


24 


Salangsang 


25 


Slade 


26 


Stracher 


27 


Tallai 


28 


Tran 


29 


Wai 


30 


Wood 



Table 1.1 



Lisa can either use a table of random numbers (found in many statistics books as well as mathematical 
handbooks) or a calculator or computer to generate random numbers. For this example, suppose Lisa 
chooses to generate random numbers from a calculator. The numbers generated are: 
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.94360; .99832; .14669; .51470; .40581; .73381; .04399 

Lisa reads two-digit groups until she has chosen three class members (that is, she reads .94360 as the groups 
94, 43, 36, 60). Each random number may only contribute one class member. If she needed to, Lisa could 
have generated more random numbers. 

The random numbers .94360 and .99832 do not contain appropriate two digit numbers. However the third 
random number, .14669, contains 14 (the fourth random number also contains 14), the fifth random number 
contains 05, and the seventh random number contains 04. The two-digit number 14 corresponds to Macierz, 
05 corresponds to Cunningham, and 04 corresponds to Cuarismo. Besides herself, Lisa's group will consist 
of Marcierz, and Cunningham, and Cuarismo. 

Besides simple random sampling, there are other forms of sampling that involve a chance process for get- 
ting the sample. Other well-known random sampling methods are the stratified sample, the cluster 
sample, and the systematic sample. 

To choose a stratified sample, divide the population into groups called strata and then take a proportionate 
number from each stratum. For example, you could stratify (group) your college population by department 
and then choose a proportionate simple random sample from each stratum (each department) to get a strat- 
ified random sample. To choose a simple random sample from each department, number each member of 
the first department, number each member of the second department and do the same for the remaining de- 
partments. Then use simple random sampling to choose proportionate numbers from the first department 
and do the same for each of the remaining departments. Those numbers picked from the first department, 
picked from the second department and so on represent the members who make up the stratified sample. 

To choose a cluster sample, divide the population into clusters (groups) and then randomly select some of 
the clusters. All the members from these clusters are in the cluster sample. For example, if you randomly 
sample four departments from your college population, the four departments make up the cluster sample. 
For example, divide your college faculty by department. The departments are the clusters. Number each 
department and then choose four different numbers using simple random sampling. All members of the 
four departments with those numbers are the cluster sample. 

To choose a systematic sample, randomly select a starting point and take every nth piece of data from a 
listing of the population. For example, suppose you have to do a phone survey. Your phone book contains 
20,000 residence listings. You must choose 400 names for the sample. Number the population 1 - 20,000 
and then use a simple random sample to pick a number that represents the first name of the sample. Then 
choose every 50th name thereafter until you have a total of 400 names (you might have to go back to the of 
your phone list). Systematic sampling is frequently chosen because it is a simple method. 

A type of sampling that is nonrandom is convenience sampling. Convenience sampling involves using 
results that are readily available. For example, a computer software store conducts a marketing study by 
interviewing potential customers who happen to be in the store browsing through the available software. 
The results of convenience sampling may be very good in some cases and highly biased (favors certain 
outcomes) in others. 

Sampling data should be done very carefully. Collecting data carelessly can have devastating results. Sur- 
veys mailed to households and then returned may be very biased (for example, they may favor a certain 
group). It is better for the person conducting the survey to select the sample respondents. 

True random sampling is done with replacement. That is, once a member is picked that member goes 
back into the population and thus may be chosen more than once. However for practical reasons, in most 
populations, simple random sampling is done without replacement. Surveys are typically done without 
replacement. That is, a member of the population may be chosen only once. Most samples are taken from 
large populations and the sample tends to be small in comparison to the population. Since this is the case, 
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sampling without replacement is approximately the same as sampling with replacement because the chance 
of picking the same individual more than once using with replacement is very low. 

For example, in a college population of 10,000 people, suppose you want to randomly pick a sample of 1000 
for a survey. For any particular sample of 1000, if you are sampling with replacement, 

• the chance of picking the first person is 1000 out of 10,000 (0.1000); 

• the chance of picking a different second person for this sample is 999 out of 10,000 (0.0999); 

• the chance of picking the same person again is 1 out of 10,000 (very low). 

If you are sampling without replacement, 

• the chance of picking the first person for any particular sample is 1000 out of 10,000 (0.1000); 

• the chance of picking a different second person is 999 out of 9,999 (0.0999); 

• you do not replace the first person before picking the next person. 

Compare the fractions 999/10,000 and 999/9,999. For accuracy, carry the decimal answers to 4 place deci- 
mals. To 4 decimal places, these numbers are equivalent (0.0999). 

Sampling without replacement instead of sampling with replacement only becomes a mathematics issue 
when the population is small which is not that common. For example, if the population is 25 people, the 
sample is 10 and you are sampling with replacement for any particular sample, 

• the chance of picking the first person is 10 out of 25 and a different second person is 9 out of 25 (you 
replace the first person). 

If you sample without replacement, 

• the chance of picking the first person is 10 out of 25 and then the second person (which is different) is 
9 out of 24 (you do not replace the first person). 

Compare the fractions 9/25 and 9/24. To 4 decimal places, 9/25 = 0.3600 and 9/24 = 0.3750. To 4 decimal 
places, these numbers are not equivalent. 

When you analyze data, it is important to be aware of sampling errors and nonsampling errors. The actual 
process of sampling causes sampling errors. For example, the sample may not be large enough. Factors 
not related to the sampling process cause nonsampling errors. A defective counting device can cause a 
nonsampling error. 

In reality, a sample will never be exactly representative of the population so there will always be 
some sampling error. As a rule, the larger the sample, the smaller the sampling error. 

In statistics, a sampling bias is created when a sample is collected from a population and some 
members of the population are not as likely to be chosen as others (remember, each member of the 
population should have an equally likely chance of being chosen). When a sampling bias happens, there 
can be incorrect conclusions drawn about the population that is being studied. 

Example 1.6 

Determine the type of sampling used (simple random, stratified, systematic, cluster, or conve- 
nience). 

1. A soccer coach selects 6 players from a group of boys aged 8 to 10, 7 players from a group of 
boys aged 11 to 12, and 3 players from a group of boys aged 13 to 14 to form a recreational 
soccer team. 

2. A pollster interviews all human resource personnel in five different high tech companies. 
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3. A high school educational researcher interviews 50 high school female teachers and 50 high 
school male teachers. 

4. A medical researcher interviews every third cancer patient from a list of cancer patients at a 
local hospital. 

5. A high school counselor uses a computer to generate 50 random numbers and then picks 
students whose names correspond to the numbers. 

6. A student interviews classmates in his algebra class to determine how many pairs of jeans a 
student owns, on the average. 

Solution 

1. stratified 

2. cluster 

3. stratified 

4. systematic 

5. simple random 

6. convenience 



If we were to examine two samples representing the same population, even if we used random sampling 
methods for the samples, they would not be exactly the same. Just as there is variation in data, there is 
variation in samples. As you become accustomed to sampling, the variability will seem natural. 

Example 1.7 

Suppose ABC College has 10,000 part-time students (the population). We are interested in the 
average amount of money a part-time student spends on books in the fall term. Asking all 10,000 
students is an almost impossible task. 

Suppose we take two different samples. 

First, we use convenience sampling and survey 10 students from a first term organic chemistry 
class. Many of these students are taking first term calculus in addition to the organic chemistry 
class . The amount of money they spend is as follows: 

$128; $87; $173; $116; $130; $204; $147; $189; $93; $153 

The second sample is taken by using a list from the P.E. department of senior citizens who take 
RE. classes and taking every 5th senior citizen on the list, for a total of 10 senior citizens. They 
spend: 

$50; $40; $36; $15; $50; $100; $40; $53; $22; $22 

Problem 1 

Do you think that either of these samples is representative of (or is characteristic of) the entire 
10,000 part-time student population? 

Solution 

No. The first sample probably consists of science-oriented students. Besides the chemistry course, 
some of them are taking first-term calculus. Books for these classes tend to be expensive. Most 
of these students are, more than likely, paying more than the average part-time student for their 
books. The second sample is a group of senior citizens who are, more than likely, taking courses 
for health and interest. The amount of money they spend on books is probably much less than the 
average part-time student. Both samples are biased. Also, in both cases, not all students have a 
chance to be in either sample. 
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Problem 2 

Since these samples are not representative of the entire population, is it wise to use the results to 
describe the entire population? 

Solution 

No. For these samples, each member of the population did not have an equally likely chance of 
being chosen. 

Now, suppose we take a third sample. We choose ten different part-time students from the dis- 
ciplines of chemistry, math, English, psychology, sociology, history, nursing, physical education, 
art, and early childhood development. (We assume that these are the only disciplines in which 
part-time students at ABC College are enrolled and that an equal number of part-time students 
are enrolled in each of the disciplines.) Each student is chosen using simple random sampling. 
Using a calculator, random numbers are generated and a student from a particular discipline is 
selected if he/she has a corresponding number. The students spend: 

$180; $50; $150; $85; $260; $75; $180; $200; $200; $150 

Problem 3 

Is the sample biased? 

Solution 

The sample is unbiased, but a larger sample would be recommended to increase the likelihood 
that the sample will be close to representative of the population. However, for a biased sampling 
technique, even a large sample runs the risk of not being representative of the population. 

Students often ask if it is "good enough" to take a sample, instead of surveying the entire popula- 
tion. If the survey is done well, the answer is yes. 



1.6.1 Optional Collaborative Classroom Exercise 

Exercise 1.6.1 

As a class, determine whether or not the following samples are representative. If they are not, 
discuss the reasons. 

1 . To find the average GPA of all students in a university, use all honor students at the univer- 
sity as the sample. 

2. To find out the most popular cereal among young people under the age of 10, stand outside 
a large supermarket for three hours and speak to every 20th child under age 10 who enters 
the supermarket. 

3. To find the average annual income of all adults in the United States, sample U.S. congress- 
men. Create a cluster sample by considering each state as a stratum (group). By using simple 
random sampling, select states to be part of the cluster. Then survey every U.S. congressman 
in the cluster. 

4. To determine the proportion of people taking public transportation to work, survey 20 peo- 
ple in New York City. Conduct the survey by sitting in Central Park on a bench and inter- 
viewing every person who sits next to you. 

5. To determine the average cost of a two day stay in a hospital in Massachusetts, survey 100 
hospitals across the state using simple random sampling. 
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1.7 Variation 7 

1.7.1 Variation in Data 

Variation is present in any set of data. For example, 16-ounce cans of beverage may contain more or less 
than 16 ounces of liquid. In one study, eight 16 ounce cans were measured and produced the following 
amount (in ounces) of beverage: 

15.8; 16.1; 15.2; 14.8; 15.8; 15.9; 16.0; 15.5 

Measurements of the amount of beverage in a 16-ounce can may vary because different people make the 
measurements or because the exact amount, 16 ounces of liquid, was not put into the cans. Manufacturers 
regularly run tests to determine if the amount of beverage in a 16-ounce can falls within the desired range. 

Be aware that as you take data, your data may vary somewhat from the data someone else is taking for the 
same purpose. This is completely natural. However, if two or more of you are taking the same data and 
get very different results, it is time for you and the others to reevaluate your data-taking methods and your 
accuracy. 

1.7.2 Variation in Samples 

It was mentioned previously that two or more samples from the same population, taken randomly, and 
having close to the same characteristics of the population are different from each other. Suppose Doreen and 
Jung both decide to study the average amount of time students at their college sleep each night. Doreen and 
Jung each take samples of 500 students. Doreen uses systematic sampling and Jung uses cluster sampling. 
Doreen's sample will be different from Jung's sample. Even if Doreen and Jung used the same sampling 
method, in all likelihood their samples would be different. Neither would be wrong, however. 

Think about what contributes to making Doreen's and Jung's samples different. 

If Doreen and Jung took larger samples (i.e. the number of data values is increased), their sample results 
(the average amount of time a student sleeps) might be closer to the actual population average. But still, 
their samples would be, in all likelihood, different from each other. This variability in samples cannot be 
stressed enough. 

1.7.2.1 Size of a Sample 

The size of a sample (often called the number of observations) is important. The examples you have seen 
in this book so far have been small. Samples of only a few hundred observations, or even smaller, are 
sufficient for many purposes. In polling, samples that are from 1200 to 1500 observations are considered 
large enough and good enough if the survey is random and is well done. You will learn why when you 
study confidence intervals. 

Be aware that many large samples are biased. For example, call-in surveys are invariable biased 
because people choose to respond or not. 



7 This content is available online at <http://cnx.Org/content/ml6021/l.15/>. 
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1.7.2.2 Optional Collaborative Classroom Exercise 

Exercise 1.7.1 

Divide into groups of two, three, or four. Your instructor will give each group one 6-sided die. 
Try this experiment twice. Roll one fair die (6-sided) 20 times. Record the number of ones, twos, 
threes, fours, fives, and sixes you get below ("frequency" is the number of times a particular face 
of the die occurs): 

First Experiment (20 rolls) 



Face on Die 


Frequency 


1 




2 




3 




4 




5 




6 





Table 1.2 
Second Experiment (20 rolls) 



Face on Die 


Frequency 


1 




2 




3 




4 




5 




6 





Table 1.3 

Did the two experiments have the same results? Probably not. If you did the experiment a third 
time, do you expect the results to be identical to the first or second experiment? (Answer yes or 
no.) Why or why not? 

Which experiment had the correct results? They both did. The job of the statistician is to see 
through the variability and draw appropriate conclusions. 



1.7.3 Critical Evaluation 

We need to critically evaluate the statistical studies we read about and analyze before accepting the results 
of the study. Common problems to be aware of include 

• Problems with Samples: A sample should be representative of the population. A sample that is not 
representative of the population is biased. Biased samples that are not representative of the popula- 
tion give results that are inaccurate and not valid. 
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• Self -Selected Samples: Responses only by people who choose to respond, such as call-in surveys are 
often unreliable. 

• Sample Size Issues: Samples that are too small may be unreliable. Larger samples are better if possible. 
In some situations, small samples are unavoidable and can still be used to draw conclusions, even 
though larger samples are better. Examples: Crash testing cars, medical testing for rare conditions. 

• Undue influence: Collecting data or asking questions in a way that influences the response. 

• Non -response or refusal of subject to participate: The collected responses may no longer be represen- 
tative of the population. Often, people with strong positive or negative opinions may answer surveys, 
which can affect the results. 

• Causality: A relationship between two variables does not mean that one causes the other to occur. 
They may both be related (correlated) because of their relationship through a different variable. 

• Self-Funded or Self-interest Studies: A study performed by a person or organization in order to sup- 
port their claim. Is the study impartial? Read the study carefully to evaluate the work. Do not 
automatically assume that the study is good but do not automatically assume the study is bad either. 
Evaluate it on its merits and the work done. 

• Misleading Use of Data: Improperly displayed graphs, incomplete data, lack of context. 

• Confounding: When the effects of multiple factors on a response cannot be separated. Confounding 
makes it difficult or impossible to draw valid conclusions about the effect of each factor. 



1.8 Answers and Rounding Off 8 

A simple way to round off answers is to carry your final answer one more decimal place than was present 
in the original data. Round only the final answer. Do not round any intermediate results, if possible. If it 
becomes necessary to round intermediate results, carry them to at least twice as many decimal places as the 
final answer. For example, the average of the three quiz scores 4, 6, 9 is 6.3, rounded to the nearest tenth, 
because the data are whole numbers. Most answers will be rounded in this manner. 

It is not necessary to reduce most fractions in this course. Especially in Probability Topics (Section 3.1), the 
chapter on probability, it is more helpful to leave an answer as an unreduced fraction. 



1.9 Frequency 9 



Twenty students were asked how many hours they worked per day. Their responses, in hours, are listed 
below: 

5; 6; 3; 3; 2; 4; 7; 5; 2; 3; 5; 6; 5; 4; 4; 3; 5; 2; 5; 3 

Below is a frequency table listing the different data values in ascending order and their frequencies. 

8 This content is available online at <http://cnx.0rg/content/ml6OO6/l. 7 />. 
9 This content is available online at <http://cnx.Org/content/ml6012/l.19/>. 
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Frequency Table of Student Work Hours 



DATA VALUE 


FREQUENCY 


2 


3 


3 


5 


4 


3 


5 


6 


6 


2 


7 


1 



Table 1.4 

A frequency is the number of times a given datum occurs in a data set. According to the table above, 
there are three students who work 2 hours, five students who work 3 hours, etc. The total of the frequency 
column, 20, represents the total number of students included in the sample. 

A relative frequency is the fraction or proportion of times an answer occurs. To find the relative fre- 
quencies, divide each frequency by the total number of students in the sample - in this case, 20. Relative 
frequencies can be written as fractions, percents, or decimals. 

Frequency Table of Student Work Hours w/ Relative Frequency 



DATA VALUE 


FREQUENCY 


RELATIVE FREQUENCY 


2 


3 


Jj or 0.15 


3 


5 


Jj or 0.25 


4 


3 


|j or 0.15 


5 


6 


Jj or 0.30 


6 


2 


Jj or 0.10 


7 


1 


Jj or 0.05 



Table 1.5 

The sum of the relative frequency column is iS, or 1. 

Cumulative relative frequency is the accumulation of the previous relative frequencies. To find the cumu- 
lative relative frequencies, add all the previous relative frequencies to the relative frequency for the current 
row. 



Frequency Table of Student Work Hours w/ Relative and Cumulative Relative Frequency 


DATA VALUE 


FREQUENCY 


RELATIVE FRE- 
QUENCY 


CUMULATIVE RELA- 
TIVE FREQUENCY 


continued on next page 
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2 


3 


|j or 0.15 


0.15 


3 


5 


|j or 0.25 


0.15 + 0.25 = 0.40 


4 


3 


J, or 0.15 


0.40 + 0.15 = 0.55 


5 


6 


|j or 0.30 


0.55 + 0.30 = 0.85 


6 


2 


|j or 0.10 


0.85 + 0.10 = 0.95 


7 


1 


^ or 0.05 


0.95 + 0.05 = 1.00 



Table 1.6 

The last entry of the cumulative relative frequency column is one, indicating that one hundred percent of 
the data has been accumulated. 

NOTE: Because of rounding, the relative frequency column may not always sum to one and the last 
entry in the cumulative relative frequency column may not be one. However, they each should be 
close to one. 

The following table represents the heights, in inches, of a sample of 100 male semiprofessional soccer play- 
ers. 

Frequency Table of Soccer Player Height 



HEIGHTS (INCHES) 


FREQUENCY 


RELATIVE FRE- 
QUENCY 


CUMULATIVE RELA- 
TIVE FREQUENCY 


59.95-61.95 


5 


4 = 0.05 


0.05 


61.95-63.95 


3 


TOO = °- 03 


0.05 + 0.03 = 0.08 


63.95 - 65.95 


15 


^=-=0 15 
100 U,XJ 


0.08 + 0.15 = 0.23 


65.95 - 67.95 


40 


^ -0 40 
100 u -^ u 


0.23 + 0.40 = 0.63 


67.95 - 69.95 


17 


— -(117 

100 u - iy 


0.63 + 0.17 = 0.80 


69.95-71.95 


12 


4=0.12 


0.80 + 0.12 = 0.92 


71.95-73.95 


7 


4 = 0.07 


0.92 + 0.07 = 0.99 


73.95 - 75.95 


1 


4 = 0.01 


0.99 + 0.01 = 1.00 




Total = 100 


Total = 1.00 





Table 1.7 

The data in this table has been grouped into the following intervals: 

• 59.95 -61.95 inches 

• 61.95 -63.95 inches 

• 63.95 - 65.95 inches 

• 65.95 - 67.95 inches 

• 67.95 - 69.95 inches 

• 69.95 -71.95 inches 

• 71.95 -73.95 inches 

• 73.95 - 75.95 inches 



NOTE: This example is used again in the Descriptive Statistics (Section 2.1) chapter, where the 
method used to compute the intervals will be explained. 
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In this sample, there are 5 players whose heights are between 59.95 - 61.95 inches, 3 players whose heights 
fall within the interval 61.95 - 63.95 inches, 15 players whose heights fall within the interval 63.95 - 65.95 
inches, 40 players whose heights fall within the interval 65.95 - 67.95 inches, 17 players whose heights 
fall within the interval 67.95 - 69.95 inches, 12 players whose heights fall within the interval 69.95 - 71.95, 
7 players whose height falls within the interval 71.95 - 73.95, and 1 player whose height falls within the 
interval 73.95 - 75.95. All heights fall between the endpoints of an interval and not at the endpoints. 

Example 1.8 

From the table, find the percentage of heights that are less than 65.95 inches. 

Solution 

If you look at the first, second, and third rows, the heights are all less than 65.95 inches. There are 
5 + 3 + 15 = 23 males whose heights are less than 65.95 inches. The percentage of heights less than 
65.95 inches is then M, or 23%. This percentage is the cumulative relative frequency entry in the 
third row. 



Example 1.9 

From the table, find the percentage of heights that fall between 61.95 and 65.95 inches. 

Solution 

Add the relative frequencies in the second and third rows: 0.03 + 0.15 = 0.18 or 18%. 



Example 1.10 

Use the table of heights of the 100 male semiprofessional soccer players. Fill in the blanks and 
check your answers. 

1. The percentage of heights that are from 67.95 to 71.95 inches is: 

2. The percentage of heights that are from 67.95 to 73.95 inches is: 

3. The percentage of heights that are more than 65.95 inches is: 

4. The number of players in the sample who are between 61.95 and 71.95 inches tall is: 

5. What kind of data are the heights? 

6. Describe how you could gather this data (the heights) so that the data are characteristic of all 
male semiprofessional soccer players. 

Remember, you count frequencies. To find the relative frequency, divide the frequency by the 
total number of data values. To find the cumulative relative frequency, add all of the previous 
relative frequencies to the relative frequency for the current row. 



1.9.1 Optional Collaborative Classroom Exercise 

Exercise 1.9.1 

In your class, have someone conduct a survey of the number of siblings (brothers and sisters) each 
student has. Create a frequency table. Add to it a relative frequency column and a cumulative 
relative frequency column. Answer the following questions: 

1. What percentage of the students in your class has siblings? 

2. What percentage of the students has from 1 to 3 siblings? 
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3. What percentage of the students has fewer than 3 siblings? 

Example 1.11 

Nineteen people were asked how many miles, to the nearest mile they commute to work each 
day. The data are as follows: 

2; 5; 7; 3; 2; 10; 18; 15; 20; 7; 10; 18; 5; 12; 13; 12; 4; 5; 10 

The following table was produced: 

Frequency of Commuting Distances 



DATA 


FREQUENCY 


RELATIVE FREQUENCY 


CUMULATIVE RELATIVE FREQUENCY 


3 


3 


3 
19 


0.1579 


4 


1 


1 
19 


0.2105 


5 


3 


3 
19 


0.1579 


7 


2 


2 
19 


0.2632 


10 


3 


4 
19 


0.4737 


12 


2 


2 
19 


0.7895 


13 


1 


1 
19 


0.8421 


15 


1 


1 
19 


0.8948 


18 


1 


1 
19 


0.9474 


20 


1 


1 
19 


1.0000 



Table 1.8 



Problem 



(Solution on p. 40.) 



1 . Is the table correct? If it is not correct, what is wrong? 

2. True or False: Three percent of the people surveyed commute 3 miles. If the statement is not 
correct, what should it be? If the table is incorrect, make the corrections. 

3. What fraction of the people surveyed commute 5 or 7 miles? 

4. What fraction of the people surveyed commute 12 miles or more? Less than 12 miles? Be- 
tween 5 and 13 miles (does not include 5 and 13 miles)? 



1.10 Summary 
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Statistics 

• Deals with the collection, analysis, interpretation, and presentation of data 

Probability 

• Mathematical tool used to study randomness 

Key Terms 

• Population 

• Parameter 

• Sample 

• Statistic 

• Variable 

• Data 

Types of Data 

• Quantitative Data (a number) 

• Discrete (You count it.) 

• Continuous (You measure it.) 

• Qualitative Data (a category, words) 

Sampling 

• With Replacement: A member of the population may be chosen more than once 

• Without Replacement: A member of the population may be chosen only once 

Random Sampling 

• Each member of the population has an equal chance of being selected 

Sampling Methods 

• Random 

• Simple random sample 

• Stratified sample 

• Cluster sample 

• Systematic sample 

• Not Random 

• Convenience sample 

Frequency (freq. or f) 

• The number of times an answer occurs 

Relative Frequency (rel. freq. or RF) 

• The proportion of times an answer occurs 

• Can be interpreted as a fraction, decimal, or percent 

Cumulative Relative Frequencies (cum. rel. freq. or cum RF) 

• An accumulation of the previous relative frequencies 

10 This content is available online at <http://cnx.Org/content/ml6023/l.10/>. 
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1.11 Practice: Sampling and Data 11 
1.11.1 Student Learning Outcomes 

• The student will construct frequency tables. 

• The student will differentiate between key terms. 

• The student will compare sampling techniques. 



1.11.2 Given 

Studies are often done by pharmaceutical companies to determine the effectiveness of a treatment program. 
Suppose that a new AIDS antibody drug is currently under study. It is given to patients once the AIDS 
symptoms have revealed themselves. Of interest is the average length of time in months patients live once 
starting the treatment. Two researchers each follow a different set of 40 AIDS patients from the start of 
treatment until their deaths. The following data (in months) are collected. 

Researcher A 3; 4; 11; 15; 16; 17; 22; 44; 37; 16; 14; 24; 25; 15; 26; 27; 33; 29; 35; 44; 13; 21; 22; 10; 12; 8; 40; 32; 
26; 27; 31; 34; 29; 17; 8; 24; 18; 47; 33; 34 

Researcher B 3; 14; 11; 5; 16; 17; 28; 41; 31; 18; 14; 14; 26; 25; 21; 22; 31; 2; 35; 44; 23; 21; 21; 16; 12; 18; 41; 22; 
16; 25; 33; 34; 29; 13; 18; 24; 23; 42; 33; 29 

1.11.3 Organize the Data 

Complete the tables below using the data provided. 

Researcher A 



Survival Length (in 
months) 


Frequency 


Relative Frequency 


Cumulative Relative Fre- 
quency 


0.5 - 6.5 








6.5-12.5 








12.5 - 18.5 








18.5 - 24.5 








24.5 - 30.5 








30.5 - 36.5 








36.5-42.5 








42.5-48.5 









Table 1.9 
Researcher B 



Survival 
months) 


Length 


(in 


Frequency 


Relative Frequency 


Cumulative Relative Fre- 
quency 


continued on next page 



1 This content is available online at <http://cnx.Org/content/ml6016/l.14/>. 
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0.5 - 6.5 








6.5 - 12.5 








12.5-18.5 








18.5 - 24.5 








24.5 - 30.5 








30.5 - 36.5 








36.5 - 42.5 








42.5-48.5 









Table 1.10 



1.11.4 Key Terms 

Define the key terms based upon the above example for Researcher A. 

Exercise 1.11.1 

Population 

Exercise 1.11.2 

Sample 

Exercise 1.11.3 

Parameter 

Exercise 1.11.4 

Statistic 

Exercise 1.11.5 

Variable 

Exercise 1.11.6 

Data 



1.11.5 Discussion Questions 

Discuss the following questions and then answer in complete sentences. 

Exercise 1.11.7 

List two reasons why the data may differ. 

Exercise 1.11.8 

Can you tell if one researcher is correct and the other one is incorrect? Why? 

Exercise 1.11.9 

Would you expect the data to be identical? Why or why not? 

Exercise 1.11.10 

How could the researchers gather random data? 

Exercise 1.11.11 

Suppose that the first researcher conducted his survey by randomly choosing one state in the 
nation and then randomly picking 40 patients from that state. What sampling method would that 
researcher have used? 
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Exercise 1.11.12 

Suppose that the second researcher conducted his survey by choosing 40 patients he knew. What 
sampling method would that researcher have used? What concerns would you have about this 
data set, based upon the data collection method? 
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1.12 Homework Link 12 

Link to homework questions in Homework Collection/Book for Sampling and Data Chapter 1 of Collabo- 
rative Statistics for R. Bloom 

Chapter 1 Homework Problems ( http://cnx.org/content/ml8858/latest/ ) 



12 This content is available online at <http://cnx.Org/content/ml9054/l.l/>. 
13 http://cnx.org/content/ml8858/latest/ 



40 CHAPTER 1. SAMPLING AND DATA 

Solutions to Exercises in Chapter 1 

Solution to Example 1.5, Problem (p. 22) 

Items 1, 5, 11, and 12 are quantitative discrete; items 4, 6, 10, and 14 are quantitative continuous; and items 
2, 3, 7, 8, 9, and 13 are qualitative. 
Solution to Example 1.10, Problem (p. 33) 

1. 29% 

2. 36% 

3. 77% 

4. 87 

5. quantitative continuous 

6. get rosters from each team and choose a simple random sample from each 

Solution to Example 1.11, Problem (p. 34) 

1. No. Frequency column sums to 18, not 19. Not all cumulative relative frequencies are correct. 

2. False. Frequency for 3 miles should be 1; for 2 miles (left out), 2. Cumulative relative frequency 
column should read: 0.1052, 0.1579, 0.2105, 0.3684, 0.4737, 0.6316, 0.7368, 0.7895, 0.8421, 0.9474, 1. 

3 1- 
J- 19 

4 1. 12 7_ 

■*• l 9 , 19 / 19 



Chapter 2 



Descriptive Statistics 

2.1 Descriptive Statistics 1 

2.1.1 Student Learning Outcomes 

By the end of this chapter, the student should be able to: 

• Display data graphically and interpret graphs: stemplots, histograms and boxplots. 

• Recognize, describe, and calculate the measures of location of data: quartiles and percentiles. 

• Recognize, describe, and calculate the measures of the center of data: mean, median, and mode. 

• Recognize, describe, and calculate the measures of the spread of data: variance, standard deviation, 
and range. 

2.1.2 Introduction 

Once you have collected data, what will you do with it? Data can be described and presented in many 
different formats. For example, suppose you are interested in buying a house in a particular area. You may 
have no clue about the house prices, so you might ask your real estate agent to give you a sample data set 
of prices. Looking at all the prices in the sample often is overwhelming. A better way might be to look 
at the median price and the variation of prices. The median and variation are just two ways that you will 
learn to describe data. Your agent might also provide you with a graph of the data. 

In this chapter, you will study numerical and graphical ways to describe and display your data. This area 
of statistics is called "Descriptive Statistics". You will learn to calculate, and even more importantly, to 
interpret these measurements and graphs. 

2.2 Displaying Data 2 

A statistical graph is a tool that helps you learn about the shape or distribution of a sample. The graph can 
be a more effective way of presenting data than a mass of numbers because we can see where data clusters 
and where there are only a few data values. Newspapers and the Internet use graphs to show trends and 
to enable readers to compare facts and figures quickly. 

Statisticians often graph data first to get a picture of the data. Then, more formal tools may be applied. 



lr rhis content is available online at <http://cnx.Org/content/ml6300/l.9/>. 
2 This content is available online at <http://cnx.Org/content/ml6297/l.9/>. 
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Some of the types of graphs that are used to summarize and organize data are the dot plot, the bar chart, 
the histogram, the stem-and-leaf plot, the frequency polygon (a type of broken line graph), pie charts, and 
the boxplot. In this chapter, we will briefly look at stem-and-leaf plots, line graphs and bar graphs. Our 
emphasis will be on histograms and boxplots. 

2.3 Stem and Leaf Graphs (Stemplots) 3 

One simple graph, the stem-and-leaf graph or stemplot, comes from the field of exploratory data analysis.lt 
is a good choice when the data sets are small. To create the plot, divide each observation of data into a stem 
and a leaf. The leaf consists of a final significant digit. For example, 23 has stem 2 and leaf 3. Four hundred 
thirty-two (432) has stem 43 and leaf 2. Five thousand four hundred thirty-two (5,432) has stem 543 and leaf 
2. The decimal 9.3 has stem 9 and leaf 3. Write the stems in a vertical line from smallest the largest. Draw a 
vertical line to the right of the stems. Then write the leaves in increasing order next to their corresponding 
stem. 

Example 2.1 

For Susan Dean's spring pre-calculus class, scores for the first exam were as follows (smallest to 
largest): 



33; 42; 49; 49; 53; 55; 55; 61; 63; 67; 68; 68; 69; 69; 72; 73; 74; 78; 80; 83; : 
96; 100 

Stem-and-Leaf Diagram 



3; 90; 92; 94; 94; 94; 94; 



Stem 


Leaf 


3 


3 


4 


299 


5 


355 


6 


1378899 


7 


2348 


8 


03888 


9 


0244446 


10 






Table 2.1 

The stemplot shows that most scores fell in the 60s, 70s, 80s, and 90s. Eight out of the 31 scores or 
approximately 26% of the scores were in the 90's or 100, a fairly high number of As. 

The stemplot is a quick way to graph and gives an exact picture of the data. You want to look for an overall 
pattern and any outliers. An outlier is an observation of data that does not fit the rest of the data. It is 
sometimes called an extreme value. When you graph an outlier, it will appear not to fit the pattern of the 
graph. Some outliers are due to mistakes (for example, writing down 50 instead of 500) while others may 
indicate that something unusual is happening. It takes some background information to explain outliers. 
In the example above, there were no outliers. 

Example 2.2 

Create a stem plot using the data: 



3 This content is available online at <http://cnx.Org/content/ml6849/l.15/>. 
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1.1; 1.5; 2.3; 2.5; 2.7; 3.2; 3.3; 3.3; 3.5; 3.8; 4.0; 4.2; 4.5; 4.5; 4.7; 4.8; 5.5; 5.6; 6.5; 6.7; 12.3 

The data are the distance (in kilometers) from a home to the nearest supermarket. 

Problem (Solution on p. 79.) 

1. Are there any outliers? 

2. Do the data seem to have any concentration of values? 

HINT: The leaves are to the right of the decimal. 



Another type of graph that is useful for specific data values is a line graph. In the particular line graph 
shown in the example, the x-axis consists of data values and the y-axis consists of frequency points. The 
frequency points are connected. 

Example 2.3 

In a survey, 40 mothers were asked how many times per week a teenager must be reminded to do 
his/her chores. The results are shown in the table and the line graph. 



Number of times teenager is reminded 


Frequency 





2 


1 


5 


2 


8 


3 


14 


4 


7 


5 


4 



Table 2.2 



16 - 
14 - 
12 - 
10 
Frequency 6 

6 - 
4 - 
2 * 

I 




12 3 4 5 6 

Number of Times Teenager is 
Reminded 



Bar graphs consist of bars that are separated from each other. The bars can be rectangles or they can be 
rectangular boxes and they can be vertical or horizontal. 

The bar graph shown in Example 4 has age groups represented on the x-axis and proportions on the y-axis. 
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Example 2.4 

By the end of 2011, in the United States, Facebook had over 146 million users. The table 
shows three age groups, the number of users in each age group and the proportion (%) of 
users in each age group. Source: http://www.kenburbary.com/2011/03/facebook-demographics- 
revisited-2011-statistics-2/ 



Age groups 


Number of Facebook users 


Proportion (%) of Facebook users 


13-25 


65,082,280 


45% 


26-44 


53,300,200 


36% 


45-64 


27,885,100 


19% 



Table 2.3 




Example 2.5 

The columns in the table below contain the race /ethnicity of U.S. Public Schools: High School 
Class of 2011, percentages for the Advanced Placement Examinee Population for that class 
and percentages for the Overall Student Population. The 3-dimensional graph shows the 
Race/Ethnicity of U.S. Public Schools (qualitative data) on the x-axis and Advanced Placement 
Examinee Population percentages on the y-axis. (Source: http://www.collegeboard.com and 
Source: http://apreport.collegeboard.org/goals-and-findings/promoting-equity) 



Race/Ethnicity 


AP Examinee Population 


Overall Student Population 


1 = Asian, Asian American or Pa- 
cific Islander 


10.3% 


5.7% 


continued on next page 
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2 = Black or African American 


9.0% 


14.7% 


3 = Hispanic or Latino 


17.0% 


17.6% 


4 = American Indian or Alaska 
Native 


0.6% 


1.1% 


5 = White 


57.1% 


59.2% 


6 = Not reported /other 


6.0% 


1.7% 



Table 2.4 



Ethnicity/Race vs. Percent of AP 
Examinees 



10.3 



17 

41 



Go to Outcomes of Education Figure 22 4 for an example of a bar graph that shows unemployment rates of 
persons 25 years and older for 2009. 

NOTE: This book contains instructions for constructing a histogram and a box plot for the TI-83+ 
and TI-84 calculators. You can find additional instructions for using these calculators on the Texas 
Instruments (TI) website 5 . 



2.4 Histograms 6 

For most of the work you do in this book, you will use a histogram to display the data. One advantage of a 
histogram is that it can readily display large data sets. A rule of thumb is to use a histogram when the data 
set consists of 100 values or more. 

A histogram consists of contiguous boxes. It has both a horizontal axis and a vertical axis. The horizontal 
axis is labeled with what the data represents (for instance, distance from your home to school). The vertical 
axis is labeled either Frequency or relative frequency. The graph will have the same shape with either 
label. The histogram (like the stemplot) can give you the shape of the data, the center, and the spread of the 
data. (The next section tells you how to calculate the center and the spread.) 



4 http://nces.ed.gov/pubs2011/2011015_5.pdf 

5 http:/ /education. ti.com/educationportal/sites/US/sectionHome/support.html 

6 This content is available online at <http://cnx.Org/content/ml6298/l.13/>. 
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The relative frequency is equal to the frequency for an observed value of the data divided by the total 
number of data values in the sample. (In the chapter on Sampling and Data (Section 1.1), we defined 
frequency as the number of times an answer occurs.) If: 



/ = frequency 

n = total number of data values (or the sum of the individual frequencies), and 

RF = relative frequency, 



then: 



RF = J - (2.1) 

n 

For example, if 3 students in Mr. Ahab's English class of 40 students received from 90% to 100%, then, 
/ = 3 , n = 40 , and RF = f - = -jjj = 0.075 

Seven and a half percent of the students received 90% to 100%. Ninety percent to 100 % are quantitative 
measures. 

To construct a histogram, first decide how many bars or intervals, also called classes, represent the data. 
Many histograms consist of from 5 to 15 bars or classes for clarity. Choose a starting point for the first 
interval to be less than the smallest data value. A convenient starting point is a lower value carried out 
to one more decimal place than the value with the most decimal places. For example, if the value with the 
most decimal places is 6.1 and this is the smallest value, a convenient starting point is 6.05 (6.1 - 0.05 = 6.05). 
We say that 6.05 has more precision. If the value with the most decimal places is 2.23 and the lowest value 
is 1.5, a convenient starting point is 1.495 (1.5 - 0.005 = 1.495). If the value with the most decimal places is 
3.234 and the lowest value is 1.0, a convenient starting point is 0.9995 (1.0 - .0005 = 0.9995). If all the data 
happen to be integers and the smallest value is 2, then a convenient starting point is 1.5 (2 - 0.5 = 1.5). Also, 
when the starting point and other boundaries are carried to one additional decimal place, no data value 
will fall on a boundary. 

Example 2.6 

The following data are the heights (in inches to the nearest half inch) of 100 male semiprofessional 
soccer players. The heights are continuous data since height is measured. 

60; 60.5; 61; 61; 61.5 

63.5; 63.5; 63.5 



64; 64; 

66; 66; 
67; 67; 

68; 68; 

70; 70; 
72; 72; 

74 



64; 

66; 
67; 

69; 

70; 

72; 



64; 64; 64; 64; 64.5; 64.5; 64.5; 64.5; 64.5; 64.5; 64.5; 64.5 

66; 66; 66; 66; 66; 66; 66; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 67; 67; 
67; 67; 67; 67; 67; 67; 67; 67.5; 67.5; 67.5; 67.5; 67.5; 67.5; 67.5 

69; 69; 69; 69; 69; 69; 69; 69; 69; 69.5; 69.5; 69.5; 69.5; 69.5 

70; 70; 70; 70.5; 70.5; 70.5; 71; 71; 71 

72.5; 72.5; 73; 73.5 



The smallest data value is 60. Since the data with the most decimal places has one decimal (for 
instance, 61.5), we want our starting point to have two decimal places. Since the numbers 0.5, 
0.05, 0.005, etc. are convenient numbers, use 0.05 and subtract it from 60, the smallest value, for 
the convenient starting point. 
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60 - 0.05 = 59.95 which is more precise than, say, 61.5 by one decimal place. The starting point is, 
then, 59.95. 

The largest value is 74. 74+ 0.05 = 74.05 is the ending value. 

Next, calculate the width of each bar or class interval. To calculate this width, subtract the starting 
point from the ending value and divide by the number of bars (you must choose the number of 
bars you desire). Suppose you choose 8 bars. 

74.05 - 59.95 , n , 

o = !- 76 ( 2 - 2 ) 



NOTE: We will round up to 2 and make each bar or class interval 2 units wide. Rounding up to 2 is 
one way to prevent a value from falling on a boundary. Rounding to the next number is necessary 
even if it goes against the standard rules of rounding. For this example, using 1.76 as the width 
would also work. 

The boundaries are: 

59.95 

59.95 + 2 = 61.95 
61.95 + 2 = 63.95 
63.95 + 2 = 65.95 
65.95 + 2 = 67.95 
67.95 + 2 = 69.95 
69.95 + 2 = 71.95 
71.95 + 2 = 73.95 
73.95 + 2 = 75.95 

The heights 60 through 61.5 inches are in the interval 59.95 - 61.95. The heights that are 63.5 are 
in the interval 61.95 - 63.95. The heights that are 64 through 64.5 are in the interval 63.95 - 65.95. 
The heights 66 through 67.5 are in the interval 65.95 - 67.95. The heights 68 through 69.5 are in the 
interval 67.95 - 69.95. The heights 70 through 71 are in the interval 69.95 - 71.95. The heights 72 
through 73.5 are in the interval 71.95 - 73.95. The height 74 is in the interval 73.95 - 75.95. 

The following histogram displays the heights on the x-axis and relative frequency on the y-axis. 
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Relative 
Frequency 



0.4 


0.05 


0.03 


0.15 


0.4 


0.17 


0.12 


0.07 




0.35 






0.3 




0.25 




0.2 




0.15 








0.1 










0.05 


0.01 


















59.95 61.95 63.95 65.95 67.95 69.95 71.95 73.95 75.95 



Heights 



Example 2.7 

The following data are the number of books bought by 50 part-time college students at ABC 
College. The number of books is discrete data since books are counted. 

1; 1; 1; 1; 1; 1; 1; 1; 1; 1; 1 

2; 2; 2; 2; 2; 2; 2; 2; 2; 2 

4; 4; 4; 4; 4; 4 

6; 6 

Eleven students buy 1 book. Ten students buy 2 books. Sixteen students buy 3 books. Six students 
buy 4 books. Five students buy 5 books. Two students buy 6 books. 

Because the data are integers, subtract 0.5 from 1, the smallest data value and add 0.5 to 6, the 
largest data value. Then the starting point is 0.5 and the ending value is 6.5. 

Problem (Solution on p. 79.) 

Next, calculate the width of each bar or class interval. If the data are discrete and there are not too 
many different values, a width that places the data values in the middle of the bar or class interval 
is the most convenient. Since the data consist of the numbers 1, 2, 3, 4, 5, 6 and the starting point is 
0.5, a width of one places the 1 in the middle of the interval from 0.5 to 1.5, the 2 in the middle of 
the interval from 1.5 to 2.5, the 3 in the middle of the interval from 2.5 to 3.5, the 4 in the middle of 

the interval from to , the 5 in the middle of the interval from to , 

and the in the middle of the interval from to . 
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Calculate the number of bars as follows: 



6.5 - 0.5 

bars 



(2.3) 



where 1 is the width of a bar. Therefore, bars — 6. 

The following histogram displays the number of books on the x-axis and the frequency on the 
y-axis. 



r 

-L. 



1.5 



2.5 3.5 

Number of Books 



4.5 



5.5 



6.5 



2.4.1 Optional Collaborative Exercise 

Count the money (bills and change) in your pocket or purse. Your instructor will record the amounts. As a 
class, construct a histogram displaying the data. Discuss how many intervals you think is appropriate. You 
may want to experiment with the number of intervals. Discuss, also, the shape of the histogram. 

Record the data, in dollars (for example, 1.25 dollars). 

Construct a histogram. 

2.5 Box Plots 7 

Box plots or box-whisker plots give a good graphical image of the concentration of the data. They also 
show how far from most of the data the extreme values are. The box plot is constructed from five values: 
the smallest value, the first quartile, the median, the third quartile, and the largest value. The median, the 
first quartile, and the third quartile will be discussed here, and then again in the section on measuring data 
in this chapter. We use these values to compare how close other data values are to them. 

The median, a number, is a way of measuring the "center" of the data. You can think of the median as the 
"middle value," although it does not actually have to be one of the observed values. It is a number that 
separates ordered data into halves. Half the values are the same number or smaller than the median and 
half the values are the same number or larger. For example, consider the following data: 



7 This content is available online at <http://cnx.Org/content/ml6296/l.12/>. 



50 CHAPTER 2. DESCRIPTIVE STATISTICS 

1; 11.5; 6; 7.2; 4; 8; 9; 10; 6.8; 8.3; 2; 2; 10; 1 

Ordered from smallest to largest: 

1; 1; 2; 2; 4; 6; 6.8; 7.2; 8; 8.3; 9; 10; 10; 11.5 

The median is between the 7th value, 6.8, and the 8th value 7.2. To find the median, add the two values 
together and divide by 2. 

6.8 + 7.2 

J = 7 (2.4) 

The median is 7. Half of the values are smaller than 7 and half of the values are larger than 7. 

Quartiles are numbers that separate the data into quarters. Quartiles may or may not be part of the data. 
To find the quartiles, first find the median or second quartile. The first quartile is the middle value of the 
lower half of the data and the third quartile is the middle value of the upper half of the data. To get the 
idea, consider the same data set shown above: 

1; 1; 2; 2; 4; 6; 6.8; 7.2; 8; 8.3; 9; 10; 10; 11.5 

The median or second quartile is 7. The lower half of the data is 1, 1, 2, 2, 4, 6, 6.8. The middle value of the 
lower half is 2. 

1; 1; 2; 2; 4; 6; 6.8 

The number 2, which is part of the data, is the first quartile. One-fourth of the values are the same or less 
than 2 and three-fourths of the values are more than 2. 

The upper half of the data is 7.2, 8, 8.3, 9, 10, 10, 11.5. The middle value of the upper half is 9. 

7.2; 8; 8.3; 9; 10; 10; 11.5 

The number 9, which is part of the data, is the third quartile. Three-fourths of the values are less than 9 
and one-fourth of the values are more than 9. 

To construct a box plot, use a horizontal number line and a rectangular box. The smallest and largest data 
values label the endpoints of the axis. The first quartile marks one end of the box and the third quartile 
marks the other end of the box. The middle fifty percent of the data fall inside the box. The "whiskers" 
extend from the ends of the box to the smallest and largest data values. The box plot gives a good quick 
picture of the data. 

NOTE: You may encounter box and whisker plots that have dots marking outlier values. In those 
cases, the whiskers are not extending to the minimum and maximum values. 

Consider the following data: 

1; 1; 2; 2; 4; 6; 6.8 ; 7.2; 8; 8.3; 9; 10; 10; 11.5 

The first quartile is 2, the median is 7, and the third quartile is 9. The smallest value is 1 and the largest 
value is 11.5. The box plot is constructed as follows (see calculator instructions in the back of this book or 
on the TI web site 8 ): 



8 http://education.ti.com/educationportal/sites/US/sectionHome/support.html 
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6 



8 



9 10 11 11.5 



The two whiskers extend from the first quartile to the smallest value and from the third quartile to the 
largest value. The median is shown with a dashed line. 

Example 2.8 

The following data are the heights of 40 students in a statistics class. 

59; 60; 61; 62; 62; 63; 63; 64; 64; 64; 65; 65; 65; 65; 65; 65; 65; 65; 65; 66; 66; 67; 67; 68; 68; 69; 70; 70; 70; 
70; 70; 71; 71; 72; 72; 73; 74; 74; 75; 77 

Construct a box plot with the following properties: 

• Smallest value = 59 

• Largest value = 77 

• Ql: First quartile = 64.5 

• Q2: Second quartile or median= 66 

• Q3: Third quartile = 70 



59 



64.5 66 



70 



77 



a. Each quarter has 25% of the data. 

b. The spreads of the four quarters are 64.5 - 59 = 5.5 (first quarter), 66 - 64.5 = 1.5 (second quarter), 

70 - 66 = 4 (3rd quarter), and 77 - 70 = 7 (fourth quarter). So, the second quarter has the 
smallest spread and the fourth quarter has the largest spread. 

c. Interquartile Range: IQR = Q3 - Ql = 70 - 64.5 = 5.5. 

d. The interval 59 through 65 has more than 25% of the data so it has more data in it than the 

interval 66 through 70 which has 25% of the data. 

e. The middle 50% (middle half) of the data has a range of 5.5 inches. 

For some sets of data, some of the largest value, smallest value, first quartile, median, and third 
quartile may be the same. For instance, you might have a data set in which the median and the 
third quartile are the same. In this case, the diagram would not have a dotted line inside the box 
displaying the median. The right side of the box would display both the third quartile and the 
median. For example, if the smallest value and the first quartile were both 1, the median and the 
third quartile were both 5, and the largest value was 7, the box plot would look as follows: 
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Example 2.9 

Test scores for a college statistics class held during the day are: 

99; 56; 78; 55.5; 32; 90; 80; 81; 56; 59; 45; 77; 84.5; 84; 70; 72; 68; 32; 79; 90 

Test scores for a college statistics class held during the evening are: 

98; 78; 68; 83; 81; 89; 88; 76; 65; 45; 98; 90; 80; 84.5; 85; 79; 78; 98; 90; 79; 81; 25.5 

Problem (Solution on p. 79.) 



What are the smallest and largest data values for each data set? 
What is the median, the first quartile, and the third quartile for each data set? 
Create a boxplot for each set of data. 

Which boxplot has the widest spread for the middle 50% of the data (the data between the 
first and third quartiles)? What does this mean for that set of data in comparison to the other 
set of data? 
• For each data set, what percent of the data is between the smallest value and the first quar- 
tile? (Answer: 25%) the first quartile and the median? (Answer: 25%) the median and the 
third quartile? the third quartile and the largest value? What percent of the data is between 
the first quartile and the largest value? (Answer: 75%) 

The first data set (the top box plot) has the widest spread for the middle 50% of the data. IQR = 
Q3 - Ql is 82.5 - 56 = 26.5 for the first data set and 89 - 78 = 11 for the second data set. 
So, the first set of data has its middle 50% of scores more spread out. 

25% of the data is between M and Q3 and 25% is between Q3 and Xmax. 



2.6 Measures of the Location of the Data 9 

The common measures of location are quartiles and percentiles (%iles). Quartiles are special percentiles. 
The first quartile, Qj is the same as the 25th percentile (25th %ile) and the third quartile, Q3, is the same as 
the 75th percentile (75th %ile). The median, M, is called both the second quartile and the 50th percentile 
(50th %ile). 

To calculate quartiles and percentiles, the data must be ordered from smallest to largest. Recall that 
quartiles divide ordered data into quarters. Percentiles divide ordered data into hundredths. To score in 
the 90th percentile of an exam does not mean, necessarily, that you received 90% on a test. It means that 
90% of test scores are the same or less than your score and 10% of the test scores are the same or greater 
than your test score. 



'This content is available online at <http://cnx.Org/content/ml6314/l.17/>. 
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Percentiles are useful for comparing values. For this reason, universities and colleges use percentiles 
extensively. 

Percentiles are mostly used with very large populations. Therefore, if you were to say that 90% of 
the test scores are less (and not the same or less) than your score, it would be acceptable because removing 
one particular data value is not significant. 

The interquartile range is a number that indicates the spread of the middle half or the middle 50% of the 
data. It is the difference between the third quartile (Q3) and the first quartile (Qi). 

IQR = Q 3 - Qi (2.5) 

The IQR can help to determine potential outliers. A value is suspected to be a potential outlier if it is 
less than (1.5) (IQR) below the first quartile or more than (1.5) (IQR) above the third quartile. Potential 
outliers always need further investigation. 

Example 2.10 

For the following 13 real estate prices, calculate the IQR and determine if any prices are outliers. 
Prices are in dollars. (Source: San Jose Mercury News) 

389,950; 230,500; 158,000; 479,000; 639,000; 114,950; 5,500,000; 387,000; 659,000; 529,000; 575,000; 
488,800; 1,095,000 

Solution 

Order the data from smallest to largest. 

114,950; 158,000; 230,500; 387,000; 389,950; 479,000; 488,800; 529,000; 575,000; 639,000; 659,000; 
1,095,000; 5,500,000 

M = 488,800 

Qi = 230500+387000 = ^8750 
Q 3 = 639000+659000 = 649000 

IQR = 649000 - 308750 = 340250 

(1.5) (IQR) = (1.5) (340250) = 510375 

Qi - (1.5) (IQR) = 308750 - 510375 = -201625 

Q 3 + (1.5) (IQR) = 649000 + 510375 = 1159375 

No house price is less than -201625. However, 5,500,000 is more than 1,159,375. Therefore, 
5,500,000 is a potential outlier. 



Example 2.11 

For the two data sets in the test scores example (p. 52), find the following: 

a. The interquartile range. Compare the two interquartile ranges. 

b. Any outliers in either set. 

c. The 30th percentile and the 80th percentile for each set. How much data falls below the 30th 

percentile? Above the 80th percentile? 
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Example 2.12: Finding Quartiles and Percentiles Using a Table 

Fifty statistics students were asked how much sleep they get per school night (rounded to the 
nearest hour). The results were (student data): 



AMOUNT OF SLEEP 
PER SCHOOL NIGHT 
(HOURS) 


FREQUENCY 


RELATIVE FRE- 
QUENCY 


CUMULATIVE RELA- 
TIVE FREQUENCY 


4 


2 


0.04 


0.04 


5 


5 


0.10 


0.14 


6 


7 


0.14 


0.28 


7 


12 


0.24 


0.52 


8 


14 


0.28 


0.80 


9 


7 


0.14 


0.94 


10 


3 


0.06 


1.00 



Table 2.5 

Find the 28th percentile: Notice the 0.28 in the "cumulative relative frequency" column. 28% of 50 
data values = 14. There are 14 values less than the 28th %ile. They include the two 4s, the five 5s, 
and the seven 6s. The 28th %ile is between the last 6 and the first 7. The 28th %ile is 6.5. 

Find the median: Look again at the "cumulative relative frequency " column and find 0.52. The 
median is the 50th %ile or the second quartile. 50% of 50 = 25. There are 25 values less than the 
median. They include the two 4s, the five 5s, the seven 6s, and eleven of the 7s. The median or 
50th %ile is between the 25th (7) and 26th (7) values. The median is 7. 

Find the third quartile: The third quartile is the same as the 75th percentile. You can "eyeball" this 
answer. If you look at the "cumulative relative frequency" column, you find 0.52 and 0.80. When 
you have all the 4s, 5s, 6s and 7s, you have 52% of the data. When you include all the 8s, you have 
80% of the data. The 75th %ile, then, must be an 8 . Another way to look at the problem is to find 
75% of 50 (= 37.5) and round up to 38. The third quartile, Q3, is the 38th value which is an 8. You 
can check this answer by counting the values. (There are 37 values below the third quartile and 12 
values above.) 

Example 2.13 

Using the table: 

1. Find the 80th percentile. 

2. Find the 90th percentile. 

3. Find the first quartile. What is another name for the first quartile? 

4. Construct a box plot of the data. 



Collaborative Classroom Exercise: Your instructor or a member of the class will ask everyone in class how 
many sweaters they own. Answer the following questions. 



1. How many students were surveyed? 

2. What kind of sampling did you do? 
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3. Find the mean and standard deviation. 

4. Find the mode. 

5. Construct 2 different histograms. For each, starting value = ending value = . 

6. Find the median, first quartile, and third quartile. 

7. Construct a box plot. 

8. Construct a table of the data to find the following: 

• The 10th percentile 

• The 70th percentile 

• The percent of students who own less than 4 sweaters 

Interpreting Percentiles, Quartiles, and Median 

A percentile indicates the relative standing of a data value when data are sorted into numerical order, from 
smallest to largest. p% of data values are less than or equal to the pth percentile. For example, 15% of data 
values are less than or equal to the 15th percentile. 

• Low percentiles always correspond to lower data values. 

• High percentiles always correspond to higher data values. 

A percentile may or may not correspond to a value judgment about whether it is "good" or "bad". The 
interpretation of whether a certain percentile is good or bad depends on the context of the situation to 
which the data applies. In some situations, a low percentile would be considered "good'; in other contexts 
a high percentile might be considered "good". In many situations, there is no value judgment that applies. 

Understanding how to properly interpret percentiles is important not only when describing data, 
but is also important in later chapters of this textbook when calculating probabilities. 

Guideline: 

When writing the interpretation of a percentile in the context of the given data, the sentence should 
contain the following information: 

• information about the context of the situation being considered, 

• the data value (value of the variable) that represents the percentile, 

• the percent of individuals or items with data values below the percentile. 

• Additionally, you may also choose to state the percent of individuals or items with data values above 
the percentile. 

Example 2.14 

On a timed math test, the first quartile for times for finishing the exam was 35 minutes. Interpret 
the first quartile in the context of this situation. 

• 25% of students finished the exam in 35 minutes or less. 

• 75% of students finished the exam in 35 minutes or more. 

• A low percentile could be considered good, as finishing more quickly on a timed exam is 
desirable. (If you take too long, you might not be able to finish.) 

Example 2.15 

On a 20 question math test, the 70th percentile for number of correct answers was 16. Interpret 
the 70th percentile in the context of this situation. 

• 70% of students answered 16 or fewer questions correctly. 

• 30% of students answered 16 or more questions correctly. 

• Note: A high percentile could be considered good, as answering more questions correctly is 
desirable. 
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Example 2.16 

At a certain community college, it was found that the 30th percentile of credit units that students 
are enrolled for is 7 units. Interpret the 30th percentile in the context of this situation. 

• 30% of students are enrolled in 7 or fewer credit units 

• 70% of students are enrolled in 7 or more credit units 

• In this example, there is no "good" or "bad" value judgment associated with a higher or 
lower percentile. Students attend community college for varied reasons and needs, and their 
course load varies according to their needs. 



Do the following Practice Problems for Interpreting Percentiles 

Exercise 2.6.1 (Solution on p. 80.) 

a. For runners in a race, a low time means a faster run. The winners in a race have the shortest 

running times. Is it more desirable to have a finish time with a high or a low percentile when 
running a race? 

b. The 20th percentile of run times in a particular race is 5.2 minutes. Write a sentence interpreting 

the 20th percentile in the context of the situation. 

c. A bicyclist in the 90th percentile of a bicycle race between two towns completed the race in 1 

hour and 12 minutes. Is he among the fastest or slowest cyclists in the race? Write a sentence 
interpreting the 90th percentile in the context of the situation. 

Exercise 2.6.2 (Solution on p. 81.) 

a. For runners in a race, a higher speed means a faster run. Is it more desirable to have a speed 

with a high or a low percentile when running a race? 

b. The 40th percentile of speeds in a particular race is 7.5 miles per hour. Write a sentence inter- 

preting the 40th percentile in the context of the situation. 

Exercise 2.6.3 (Solution on p. 81.) 

On an exam, would it be more desirable to earn a grade with a high or low percentile? Explain. 

Exercise 2.6.4 (Solution on p. 81.) 

Mina is waiting in line at the Department of Motor Vehicles (DMV). Her wait time of 32 minutes 
is the 85th percentile of wait times. Is that good or bad? Write a sentence interpreting the 85th 
percentile in the context of this situation. 

Exercise 2.6.5 (Solution on p. 81.) 

In a survey collecting data about the salaries earned by recent college graduates, Li found that her 
salary was in the 78th percentile. Should Li be pleased or upset by this result? Explain. 

Exercise 2.6.6 (Solution on p. 81.) 

In a study collecting data about the repair costs of damage to automobiles in a certain type of 
crash tests, a certain model of car had $1700 in damage and was in the 90th percentile. Should the 
manufacturer and /or a consumer be pleased or upset by this result? Explain. Write a sentence 
that interprets the 90th percentile in the context of this problem. 
Exercise 2.6.7 (Solution on p. 81.) 

The University of California has two criteria used to set admission standards for freshman to be 
admitted to a college in the UC system: 

a. Students' GPAs and scores on standardized tests (SATs and ACTs) are entered into a formula 
that calculates an "admissions index" score. The admissions index score is used to set eligi- 
bility standards intended to meet the goal of admitting the top 12% of high school students 
in the state. In this context, what percentile does the top 12% represent? 



57 



b. Students whose GPAs are at or above the 96th percentile of all students at their high school 
are eligible (called eligible in the local context), even if they are not in the top 12% of all 
students in the state. What percent of students from each high school are "eligible in the 
local context"? 

Exercise 2.6.8 (Solution on p. 81.) 

Suppose that you are buying a house. You and your realtor have determined that the most expen- 
sive house you can afford is the 34th percentile. The 34th percentile of housing prices is $240,000 
in the town you want to move to. In this town, can you afford 34% of the houses or 66% of the 
houses? 

**With contributions from Roberta Bloom 

2.7 Measures of the Center of the Data 10 

The "center" of a data set is also a way of describing location. The two most widely used measures of the 
"center" of the data are the mean (average) and the median. To calculate the mean weight of 50 people, 
add the 50 weights together and divide by 50. To find the median weight of the 50 people, order the data 
and find the number that splits the data into two equal parts (previously discussed under box plots in this 
chapter). The median is generally a better measure of the center when there are extreme values or outliers 
because it is not affected by the precise numerical values of the outliers. The mean is the most common 
measure of the center. 

NOTE: The words "mean" and "average" are often used interchangeably. The substitution of one 
word for the other is common practice. The technical term is "arithmetic mean" and "average" is 
technically a center location. However, in practice among non-statisticians, "average" is commonly 
accepted for "arithmetic mean." 

The mean can also be calculated by multiplying each distinct value by its frequency and then dividing the 
sum by the total number of data values. The letter used to represent the sample mean is an x with a bar 
over it (pronounced "x bar"): x. 

The Greek letter \i (pronounced "mew") represents the population mean. One of the requirements for the 
sample mean to be a good estimate of the population mean is for the sample taken to be truly random. 

To see that both ways of calculating the mean are the same, consider the sample: 

1; 1; 1; 2; 2; 3; 4; 4; 4; 4; 4 

1+1+1+2+2+3+4+4+4+4+4 



11 

3x1+2x2+1x3+5x4 



2.7 (2.6) 



2.7 (2.7) 



11 
In the second example, the frequencies are 3, 2, 1, and 5. 

You can quickly find the location of the median by using the expression ^i± . 

The letter n is the total number of data values in the sample. If n is an odd number, the median is the middle 
value of the ordered data (ordered smallest to largest). If n is an even number, the median is equal to the 
two middle values added together and divided by 2 after the data has been ordered. For example, if the 
total number of data values is 97, then ^i^-= ^tl= 49. The median is the 49th value in the ordered data. 



"This content is available online at <http://cnx.Org/content/ml7102/l.ll/>. 
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If the total number of data values is 100, then ^-^-= y = 50.5. The median occurs midway between the 
50th and 51st values. The location of the median and the value of the median are not the same. The upper 
case letter M is often used to represent the median. The next example illustrates the location of the median 
and the value of the median. 

Example 2.17 

AIDS data indicating the number of months an AIDS patient lives after taking a new antibody 
drug are as follows (smallest to largest): 

3; 4; 8; 8; 10; 11; 12; 13; 14; 15; 15; 16; 16; 17; 17; 18; 21; 22; 22; 24; 24; 25; 26; 26; 27; 27; 29; 29; 31; 32; 
33; 33; 34; 34; 35; 37; 40; 44; 44; 47 

Calculate the mean and the median. 

Solution 

The calculation for the mean is: 

j _ [3+4+(8)(2)+10+ll+12+13+14+(15)(2) + (16)(2)+...+35+37+40+(44)(2)+47] _ ^ , 

To find the median, M, first use the formula for the location. The location is: 
n+1 = 40+1 = 20.5 

Starting at the smallest value, the median is located between the 20th and 21st values (the two 
24s): 

3; 4; 8; 8; 10; 11; 12; 13; 14; 15; 15; 16; 16; 17; 17; 18; 21; 22; 22; 24; 24; 25; 26; 26; 27; 27; 29; 29; 31; 32; 
33; 33; 34; 34; 35; 37; 40; 44; 44; 47 

M = 24+24 = 24 

The median is 24. 



Example 2.18 

Suppose that, in a small town of 50 people, one person earns $5,000,000 per year and the other 49 
each earn $30,000. Which is the better measure of the "center," the mean or the median? 

Solution 

Y — 5000000+49x30000 _ 129400 

M = 30000 

(There are 49 people who earn $30,000 and one person who earns $5,000,000.) 

The median is a better measure of the "center" than the mean because 49 of the values are 30,000 
and one is 5,000,000. The 5,000,000 is an outlier. The 30,000 gives us a better sense of the middle of 
the data. 



Another measure of the center is the mode. The mode is the most frequent value. If a data set has two 
values that occur the same number of times, then the set is bimodal. 



Example 2.19: Statistics exam scores for 20 students are as follows 

Statistics exam scores for 20 students are as follows: 
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50 ; 53 ; 59 ; 59 ; 63 ; 63 ; 72 ; 72 ; 72 ; 72 ; 72 ; 76 ; 78 ; 81 ; 83 ; 84 ; 84 ; 84 ; 90 ; 93 

Problem 

Find the mode. 

Solution 

The most frequent score is 72, which occurs five times. Mode = 72. 



Example 2.20 

Five real estate exam scores are 430, 430, 480, 480, 495. The data set is bimodal because the scores 
430 and 480 each occur twice. 

When is the mode the best measure of the "center"? Consider a weight loss program that advertises 
an average weight loss of six pounds the first week of the program. The mode might indicate that 
most people lose two pounds the first week, making the program less appealing. 

NOTE: The mode can be calculated for qualitative data as well as for quantitative data. 

Statistical software will easily calculate the mean, the median, and the mode. Some graphing 
calculators can also make these calculations. In the real world, people make these calculations 
using software. 



2.7.1 The Law of Large Numbers and the Mean 

The Law of Large Numbers says that if you take samples of larger and larger size from any population, 
then the mean x of the sample is very likely to get closer and closer to }i. This is discussed in more detail in 
The Central Limit Theorem. 

NOTE: The formula for the mean is located in the Summary of Formulas (Section 2.10) section 
course. 



2.7.2 Sampling Distributions and Statistic of a Sampling Distribution 

You can think of a sampling distribution as a relative frequency distribution with a great many samples. 
(See Sampling and Data for a review of relative frequency). Suppose thirty randomly selected students 
were asked the number of movies they watched the previous week. The results are in the relative frequency 
table shown below. 



# of movies 


Relative Frequency 





5/30 


1 


15/30 


2 


6/30 


3 


4/30 


4 


1/30 



Table 2.6 
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If you let the number of samples get very large (say, 300 million or more), the relative frequency table 
becomes a relative frequency distribution. 

A statistic is a number calculated from a sample. Statistic examples include the mean, the median and the 
mode as well as others. The sample mean x is an example of a statistic which estimates the population 
mean \i. 

2.8 Skewness and the Mean, Median, and Mode 11 

Consider the following data set: 

4;5;6;6;6;7;7;7;7;7;7;8;8;8;9;10 

This data set produces the histogram shown below. Each interval has width one and each value is located 
in the middle of an interval. 



8 



10 



The histogram displays a symmetrical distribution of data. A distribution is symmetrical if a vertical line 
can be drawn at some point in the histogram such that the shape to the left and the right of the vertical 
line are mirror images of each other. The mean, the median, and the mode are each 7 for these data. In a 
perfectly symmetrical distribution, the mean and the median are the same. This example has one mode 
(unimodal) and the mode is the same as the mean and median. In a symmetrical distribution that has two 
modes (bimodal), the two modes would be different from the mean and median. 

The histogram for the data: 

4;5;6;6;6;7;7;7;7;8 

is not symmetrical. The right-hand side seems "chopped off" compared to the left side. The shape distribu- 
tion is called skewed to the left because it is pulled out to the left. 



1 This content is available online at <http://cnx.org/content/ml71 04/1.9/>. 
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The mean is 6.3, the median is 6.5, and the mode is 7. Notice that the mean is less than the median and 
they are both less than the mode. The mean and the median both reflect the skewing but the mean more 
so. 

The histogram for the data: 

6;7;7;7;7;8;8;8;9;10 

is also not symmetrical. It is skewed to the right. 



8 



10 



The mean is 7.7, the median is 7.5, and the mode is 7. Of the three statistics, the mean is the largest, while 
the mode is the smallest. Again, the mean reflects the skewing the most. 

To summarize, generally if the distribution of data is skewed to the left, the mean is less than the median, 
which is often less than the mode. If the distribution of data is skewed to the right, the mode is often less 
than the median, which is less than the mean. 



Skewness and symmetry become important when we discuss probability distributions in later chapters. 
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2.9 Measuring the Spread of the Data (modified R. Bloom) 1 



1 12 



An important characteristic of any set of data is the variation in the data. In some data sets, the data values 
are concentrated closely near the mean; in other data sets, the data values are more widely spread out from 
the mean. The most common measure of variation, or spread, is the standard deviation. 

The standard deviation is a number that measures how far data values are from their mean. 
The standard deviation 

• provides a numerical measure of the overall amount of variation in a data set 

• can be used to determine whether a particular data value is close to or far from the mean 

First we will investigate what the standard deviation tells us about data; then we will learn to calculate the 
standard deviation. 

The standard deviation provides a measure of the overall variation in a data set 

The standard deviation is always positive or 0. The standard deviation is small when the data are all 
concentrated close to the mean, exhibiting little variation or spread. The standard deviation is larger when 
the data values are more spread out from the mean, exhibiting more variation. 

Suppose that we are studying waiting times at the checkout line for customers at supermarket A and su- 
permarket B; the average wait time at both markets is 5 minutes. At market A, the standard deviation for 
the waiting time is 2 minutes; at market B the standard deviation for the waiting time is 4 minutes. Because 
market B has a higher standard deviation, we know that there is more variation in the waiting times at 
market B. Overall, wait times at market B are more spread out from the average; wait times at market A are 
more concentrated near the average. 

The standard deviation can be used to determine whether a data value is close to or far from the mean. 

Suppose that Rosa and Binh both shop at Market A. Rosa waits for 7 minutes and Binh waits for 1 minute 
at the checkout counter. At market A, the mean wait time is 5 minutes and the standard deviation is 2 
minutes. 

Rosa waits for 7 minutes: 

• 7 is 2 minutes longer than the average of 5; 2 minutes is equal to one standard deviation. 

• Rosa's wait time of 7 minutes is 2 minutes longer than the average of 5 minutes. 

• Rosa's wait time of 7 minutes is one standard deviation above the average of 5 minutes. 

• A wait time that is only one standard deviation from the average is considered close to the average. 

Binh waits for 1 minute: 

• 1 is 4 minutes less than the average of 5; 4 minutes is equal to two standard deviations. 

• Binh's wait time of 1 minute is 4 minutes less than the average of 5 minutes. 

• Binh's wait time of 1 minute is two standard deviations below the average of 5 minutes. 

• A data value that is two standard deviations from the average is just on the borderline for what many 
statisticians would consider to be far from the average. Considering data to be far from the mean if it 
is more than 2 standard deviations away is more of an approximate "rule of thumb" than a rigid rule. 
In general, the shape of the distribution of the data affects how much of the data is further away than 
2 standard deviations. (We will learn more about this in later chapters.) 

• In general, a value = mean + (#ofSTDEVs) (standard deviation) 

• where #ofSTDEVs = the number of standard deviations 

• 7 is one standard deviation more than the mean of 5 because: 7=5+(l)(2) 



2 This content is available online at <http://cnx.Org/content/ml8923/l.2/>. 
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• 1 is two standard deviations less than the mean of 5 because: l=5+(— 2)(2) 
Calculating the Standard Deviation 

If x is a data value, then the difference "x - mean" is called its deviation. In a data set, there are as many 
deviations as there are items in the data set. The deviations are used to calculate the standard deviation. If 
the data is for a population, in symbols a deviation is x — pi . For sample data, in symbols a deviation is x— 

x . 

The procedure to calculate the standard deviation depends on whether the data is for the entire popula- 
tion or comes from a sample. The calculations are similar, but not identical. Therefore the symbol used 
to represent the standard deviation depends on whether it is a population or a sample. The lower case 
letter s represents the sample standard deviation and the Greek letter cr (sigma, lower case) represents the 
population standard deviation. If the sample has the same characteristics as the population, then s should 
be a good estimate of cr. 

To calculate the standard deviation, we need to calculate the variance first. The variance is an average of 
the squares of the deviations (the x— x values for a sample, or the x — pi values for a population). The 
symbol cr 2 represents the population variance; the population standard deviation a is the square root of 
the population variance. The symbol s 2 represents the sample variance; the sample standard deviation s is 
the square root of the sample variance. You can think of the standard deviation as a special average of the 
deviations. 

If the data is from a population, when we calculate the average of the squared deviations to find the vari- 
ance, we divide by N, the number of items in the population. If the data is from a sample rather than a 
population, when we calculate the average of the squared deviations, we divide by n-1, one less than the 
number of items in the sample. You can see that in the formulas below. 

Formulas for Sample Standard Deviation 



• s = y /gi ^ors = V £/ n-i* )2 



For the sample standard deviation, the denominator is n-1, that is the sample size MINUS 1. 
Formulas for Population Standard Deviation 



• a = V n o^ = y J N 

• For the population standard deviation, the denominator is N, the number of items in the population. 

In these fomulas, / represents the frequency with which a value appears. For example, if a value appears 
once, / is 1. If a value appears three times in the data set, / is 3. 

NOTE: In practice, USE A CALCULATOR OR COMPUTER SOFTWARE TO CALCULATE THE 
STANDARD DEVIATION. If you are using a TI-83,83+,84+ calculator, you need to select the 
appropriate standard deviation cr or s from the summary statistics. We will concentrate on using 
and interpreting the information that the standard deviation gives us. However you should study 
the following step-by-step example to help you understand how the standard deviation measures 
variation from the mean. 

Example 2.21 

At an elementary school, a teacher was interested in the average age and the standard deviation 
of the ages of the students in the fifth grade. The following data are the ages for a SAMPLE of 
n=20 fifth grade students. The ages are rounded to the nearest half year: 
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9 ; 9.5 ; 9.5 ; 10 ; 10 ; 10 ; 10 ; 10.5 ; 10.5 ; 10.5 ; 10.5 ; 11 ; 11 ; 11 ; 11 ; 11 ; 11 ; 11.5 ; 11.5 ; 11.5 
9 + 9.5 x 2 + 10 x 4 + 10.5 x 4 + 11 x 6 + 11.5 x 3 



x = 



20 



= 10.525 



(2.8) 



The sample average age is 10.53 years, rounded to 2 places. 



The variance may be calculated by using a table. Then the standard deviation is calculated by 
taking the square root of the variance. We will explain the parts of the table after calculating s. 



Data 


Freq. 


Deviations 


Deviations 2 


(Freq.)(Deviations 2 ) 


X 


/ 


(x — x) 


(x-x) 2 


(f)(x-x) 2 


9 


1 


9 - 10.525 = -1.525 


(-1.525) 2 = 2.325625 


1 x 2.325625 = 2.325625 


9.5 


2 


9.5 - 10.525 = -1.025 


(-1.025) 2 = 1.050625 


2 x 1.050625 = 2.101250 


10 


4 


10 - 10.525 = -0.525 


(-0.525) 2 = 0.275625 


4 x .275625 = 1.1025 


10.5 


4 


10.5 - 10.525 = -0.025 


(-0.025) 2 = 0.000625 


4 x .000625 = .0025 


11 


6 


11 - 10.525 = 0.475 


(0.475) 2 = 0.225625 


6 x .225625 = 1.35375 


11.5 


3 


11.5 - 10.525 = 0.975 


(0.975) 2 = 0.950625 


3 x .950625 = 2.851875 



Table 2.7 

The sample variance s 2 is equal to the sum of the last column (9.7375) divided by the total number 
of data values minus one (20 - 1): 

s 2 = |7^5 = 0.5125 

The sample standard deviation s is equal to the square root of the sample variance: 

s = V0.5125 = 0.715891 Rounded to two decimal places, s = 0.72 

Typically, you do the calculation for the standard deviation on a calculator or computer. Also 
note that we only rounded at the end of the calculation. The intermediate results are not rounded; 
this is done for accuracy. 

Problem 1 

Verify the mean and standard deviation calculated above using a calculator or computer. 

Solution 

For the TI-83,83+,84+, enter data into the list editor. 

Put the data values in list LI and the frequencies in list L2. 

STAT CALC 1-VarStats LI, L2 

x=10.525 

Use Sx because this is sample data (not a population): Sx=. 715891 



• For the following problems, recall that value = mean + (#ofSTDEVs)(standard deviation) 

• For a sample: x = x + (#ofSTDEVs)(s) 

• For a population: x = ji + (#ofSTDEVs)( a) 

• For this example, use x = x + (#ofSTDEVs)(s) because the data is from a sample 
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Problem 2 

Find the value that is 1 standard deviation above the mean. Find (x + Is). 

Solution 

(x + Is) = 10.53 + (1) (0.72) = 11.25 

Problem 3 

Find the value that is two standard deviations below the mean. Find (x — 2s). 

Solution 

(x - 2s) = 10.53 - (2) (0.72) = 9.09 

Problem 4 

Find the values that are 1.5 standard deviations from (below and above) the mean. 

Solution 

(x - 1.5s) = 10.53 - (1.5) (0.72) = 9.45 
(x + 1.5s) = 10.53 + (1.5) (0.72) = 11.61 



Explanation of the standard deviation calculation shown in the table 

The deviations show how spread out the data are about the mean. The data value 11.5 is farther from the 
mean than is the data value 11. The deviations 0.97 and 0.47 indicate that. A positive deviation occurs when 
the data value is greater than the mean. A negative deviation occurs when the data value is less than the 
mean; the deviation is -1.525 for the data value 9. If you add up all the deviations, the sum is always zero; 
the positive and negative deviations offset each other. (For this example, there are n=20 deviations.) So you 
cannot simply add the deviations to get the spread of the data. By squaring the deviations, they become 
positive numbers, and the sum will also be positive. The variance, then, is the average squared deviation. 

The variance is a squared measure and does not have the same units as the data. Taking the square root 
solves the problem. The standard deviation measures the spread in the same units as the data. 

Notice that instead of dividing by n=20, the calculation divided by n-l=20-l=19 because the data is a sam- 
ple. For the sample variance, we divide by the sample size minus one (n — 1). Why not divide by n? The 
answer has to do with the population variance. The sample variance is an estimate of the population vari- 
ance. Based on the theoretical mathematics that lies behind these calculations, dividing by (n — 1) gives a 
better estimate of the population variance. 

NOTE: Your concentration should be on what the standard deviation tells us about the data. The 
standard deviation is a number which measures how far the data are spread from the mean. Let a 
calculator or computer do the arithmetic. 

The standard deviation, s or a, is either zero or larger than zero. When the standard deviation is 0, there is 
no spread; that is, the all the data values are equal to each other. The standard deviation is small when the 
data are all concentrated close to the mean, and is larger when the data values show more variation from 
the mean. When the standard deviation is a lot larger than zero, the data values are very spread out about 
the mean; outliers can make s or a very large. 

The standard deviation, when first presented, can seem unclear. By graphing your data, you can get a 
better "feel" for the deviations and the standard deviation. You will find that in symmetrical distributions, 
the standard deviation can be very helpful but in skewed distributions, the standard deviation may not be 
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much help. The reason is that the two sides of a skewed distribution have different spreads. In a skewed 
distribution, it is better to look at the first quartile, the median, the third quartile, the smallest value, and 
the largest value. Because numbers can be confusing, always graph your data. 

Example 2.22 

Use the following data (first exam scores) from Susan Dean's spring pre-calculus class: 

33; 42; 49; 49; 53; 55; 55; 61; 63; 67; 68; 68; 69; 69; 72; 73; 74; 78; 80; 83; 88; 88; 88; 90; 92; 94; 94; 94; 94; 
96; 100 

a. Create a chart containing the data, frequencies, relative frequencies, and cumulative relative 

frequencies to three decimal places. 

b. Calculate the following to one decimal place using a TI-83+ or TI-84 calculator: 

i. The sample mean 

ii. The sample standard deviation 

iii. The median 

iv. The first quartile 

v. The third quartile 

vi. IQR 

c. Construct a box plot and a histogram on the same set of axes. Make comments about the box 

plot, the histogram, and the chart. 

Solution 
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a. 



Data 


Frequency 


Relative Frequency 


Cumulative Relative Frequency 


33 


1 


0.032 


0.032 


42 


1 


0.032 


0.064 


49 


2 


0.065 


0.129 


53 


1 


0.032 


0.161 


55 


2 


0.065 


0.226 


61 


1 


0.032 


0.258 


63 


1 


0.032 


0.29 


67 


1 


0.032 


0.322 


68 


2 


0.065 


0.387 


69 


2 


0.065 


0.452 


72 


1 


0.032 


0.484 


73 


1 


0.032 


0.516 


74 


1 


0.032 


0.548 


78 


1 


0.032 


0.580 


80 


1 


0.032 


0.612 


83 


1 


0.032 


0.644 


88 


3 


0.097 


0.741 


90 


1 


0.032 


0.773 


92 


1 


0.032 


0.805 


94 


4 


0.129 


0.934 


96 


1 


0.032 


0.966 


100 


1 


0.032 


0.998 (Why isn't this value 1?) 



Table 2.8 



i. The sample mean = 73.5 

ii. The sample standard deviation = 17.9 

iii. The median = 73 

iv. The first quartile = 61 

v. The third quartile = 90 

vi. IQR = 90 - 61 = 29 
The x-axis goes from 32.5 to 100.5; y-axis goes from -2.4 to 15 for the histogram; number of 

intervals is 5 for the histogram so the width of an interval is (100.5 - 32.5) divided by 5 which 

is equal to 13.6. Endpoints of the intervals: starting point is 32.5, 32.5+13.6 = 46.1, 46.1+13.6 = 

59.7, 59.7+13.6 = 73.3, 73.3+13.6 = 86.9, 86.9+13.6 = 100.5 = the ending value; No data values 

fall on an interval boundary. 



68 



CHAPTER 2. DESCRIPTIVE STATISTICS 



32.5 


















1 

1 
1 
1 
| 




































46.1 


5B.7 73.3 


86.9 100.5 





Figure 2.1 



The long left whisker in the box plot is reflected in the left side of the histogram. The spread of the 
exam scores in the lower 50% is greater (73-33=40) than the spread in the upper 50% (100-73=27). 
The histogram, box plot, and chart all reflect this. There are a substantial number of A and B grades 
(80s, 90s, and 100). The histogram clearly shows this. The box plot shows us that the middle 50% 
of the exam scores (IQR=29) are Ds, Cs, and Bs. The box plot also shows us that the lower 25% of 
the exam scores are Ds and Fs. 



Comparing Values from Different Data Sets 

The standard deviation is useful when comparing data values that come from different data sets. If the data 
sets have different means and standard deviations, it can be misleading to compare the data values directly. 

• For each data value, calculate how many standard deviations the value is away from its mean. 

• Use the formula: value = mean + (#ofSTDEVs)(standard deviation); solve for #ofSTDEVs. 

• #0fSTDEVs = st ™ d ™d Nation 

• Compare the results of this calculation. 

#ofSTDEVs is often called a "z-score"; we can use the symbol z. In symbols, the formulas become: 



Sample 


X = X + z s 


Z = 2=* 

s 


Population 


x = ji + z a 


x—u 



Table 2.9 
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Example 2.23 

Two students, John and Ali, from different high schools, wanted to find out who had the highest 
grade point average (GPA) when compared to his school. Which student had the highest GPA 
when compared to his school? 



Student 


GPA 


School Mean GPA 


School Standard Deviation 


John 


2.85 


3.0 


0.7 


Ali 


77 


80 


10 



Table 2.10 

Solution 

For each student, determine how many standard deviations (#ofSTDEVs) his GPA is away from 
the average, for his school. Pay careful attention to signs when comparing and interpreting the 
answer. 



#ofSTDEVs = 



value— mean 
standard deviation 



;z = 



X — jl 



2.85-3.0 



For John, z = #ofSTDEVs — — ^y- 
For Ali, z = #ofSTDEVs = 7 - 2 ^ =- - 0.3 



0.21 



10 



John has the better GPA when compared to his school because his GPA is 0.21 standard deviations 
below his mean while Ali's GPA is 0.3 standard deviations below his mean. 

John's z-score of —0.21 is higher than Ali's z-score of —0.3 . For GPA, higher values are better, so 
we conclude that John has the better GPA when compared to his school. 



The following lists give a few facts that provide a little more insight into what the standard deviation tells 
us about the distribution of the data. 

For ANY data set, no matter what shape the distribution of the data is: 

• At least 75% of the data is within 2 standard deviations of the mean. 

• At least 89% of the data is within 3 standard deviations of the mean. 

• At least 95% of the data is within 4 1/2 standard deviations of the mean. 

• This is known as Chebyshev's Rule. 

For data having a distribution that is MOUND-SHAPED and SYMMETRIC: 

• Approximately 68% of the data is within 1 standard deviation of the mean. 

• Approximately 95% of the data is within 2 standard deviations of the mean. 

• More than 99% of the data is within 3 standard deviation of the mean. 

• This is known as the Empirical Rule. 

• It is important to note that this rule only applies when the shape of the distribution of the data is 
mound-shaped and symmetric. We will learn more about this when studying the "Normal" or "Gaus- 
sian" probability distribution in later chapters. 
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2.10 Summary of Formulas 13 

Commonly Used Symbols 

• The symbol E means to add or to find the sum. 

• n = the number of data values in a sample 

• N = the number of people, things, etc. in the population 

• x = the sample mean 

• s = the sample standard deviation 

• }i = the population mean 

• a = the population standard deviation 

• / = frequency 

• x = numerical value 

Commonly Used Expressions 



x * f = A value multiplied by its respective frequency 

yj x = The sum of the values 

yj x * f = The sum of values multiplied by their respective frequencies 

(x — x) or (x — ji) = Deviations from the mean (how far a value is from the mean) 

(x — x) or (x — ]i) = Deviations squared 

/ (x — x) or / (x — fi) = The deviations squared and multiplied by their frequencies 



Mean Formulas: 








• 


x ■ 


= ^- or x = 


.£/•* 






• 


F 


= TT or V= 


Lf-x 

N 






Stan 


dai 

s - 

a 


d Deviatio 


n Forn 

or s = 
> 
- or c ; 


uilas: 




• 


_ /Z(x-xf 
V n-\ 


. /Zf-(x-x) 2 
V n-l 


• 


~ V N 


_ V N 


■P) 2 



Formulas Relating a Value, the Mean, and the Standard Deviation: 

• value = mean + (#ofSTDEVs) (standard deviation), where #ofSTDEVs = the number of standard devi- 
ations 

• x = x+ (#ofSTDEVs)(s) 

• x = ji + (#ofSTDEVs)(tr) 



3 This content is available online at <http://cnx.org/content/ml6310/1.9/>. 
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2.11 Practice 1: Center of the Data 
2.11.1 Student Learning Outcomes 

• The student will calculate and interpret the center, spread, and location of the data. 

• The student will construct and interpret histograms an box plots. 



2.11.2 Given 

Sixty-five randomly selected car salespersons were asked the number of cars they generally sell in one 
week. Fourteen people answered that they generally sell three cars; nineteen generally sell four cars; twelve 
generally sell five cars; nine generally sell six cars; eleven generally sell seven cars. 

2.11.3 Complete the Table 



Data Value (# cars) 


Frequency 


Relative Frequency 


Cumulative Relative Frequency 



























































Table 2.11 



(Solution on p. 81.) 
(Solution on p. 81.) 



2.11.4 Discussion Questions 

Exercise 2.11.1 

What does the frequency column sum to? Why? 

Exercise 2.11.2 

What does the relative frequency column sum to? Why? 

Exercise 2.11.3 

What is the difference between relative frequency and frequency for each data value? 

Exercise 2.11.4 

What is the difference between cumulative relative frequency and relative frequency for each data 
value? 



2.11.5 Enter the Data 

Enter your data into your calculator or computer. 

14 This content is available online at <http://cnx.Org/content/ml6312/l.12/>. 
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2.11.6 Construct a Histogram 

Determine appropriate minimum and maximum x and y values and the scaling. Sketch the histogram 
below. Label the horizontal and vertical axes with words. Include numerical scaling. 



2.11.7 Data Statistics 

Calculate the following values: 

Exercise 2.11.5 

Sample mean = x = 

Exercise 2.11.6 

Sample standard deviation = s x 

Exercise 2.11.7 

Sample size = n = 



(Solution on p. 81.) 



(Solution on p. 81.) 



(Solution on p. 81.) 



2.11.8 Calculations 

Use the table in section 2.11.3 to calculate the following values: 

Exercise 2.11.8 

Median = 

Exercise 2.11.9 

Mode = 

Exercise 2.11.10 

First quartile = 

Exercise 2.11.11 

Second quartile = median = 50th percentile = 

Exercise 2.11.12 

Third quartile = 

Exercise 2.11.13 

Interquartile range (IQR) = - = 

Exercise 2.11.14 

10th percentile = 

Exercise 2.11.15 

70th percentile = 



(Solution on p. 81.) 



(Solution on p. 81.) 



(Solution on p. 81.) 



(Solution on p. 82.) 



(Solution on p. 82.) 



(Solution on p. 82.) 



(Solution on p. 82.) 



(Solution on p. 82.) 
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Exercise 2.11.16 (Solution on p. 82.) 

Find the value that is 3 standard deviations: 

a. Above the mean 

b. Below the mean 



2.11.9 Box Plot 

Construct a box plot below. Use a ruler to measure and scale accurately. 

2.11.10 Interpretation 

Looking at your box plot, does it appear that the data are concentrated together, spread out evenly, or 
concentrated in some areas, but not in others? How can you tell? 
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2.12 Practice 2: Spread of the Data 15 

2.12.1 Student Learning Outcomes 

• The student will calculate measures of the center of the data. 

• The student will calculate the spread of the data. 

2.12.2 Given 

The population parameters below describe the full-time equivalent number of students (FTES) each year 
at Lake Tahoe Community College from 1976-77 through 2004-2005. (Source: Graphically Speaking by Bill 
King, LTCC Institutional Research, December 2005). 

Use these values to answer the following questions: 

• ]i = 1000 FTES 

• Median - 1014 FTES 

• cr = 474 FTES 

• First quartile = 528.5 FTES 

• Third quartile = 1447.5 FTES 

• n = 29 years 



2.12.3 Calculate the Values 

Exercise 2.12.1 

A sample of 11 years is taken. About how many are expected to have a FTES 
Explain how you determined your answer. 

Exercise 2.12.2 

75% of all years have a FTES: 

a. At or below: 

b. At or above: 

Exercise 2.12.3 

The population standard deviation = 

Exercise 2.12.4 

What percent of the FTES were from 528.5 to 1447.5? How do you know? 

Exercise 2.12.5 

What is the IQR? What does the IQR represent? 

Exercise 2.12.6 

How many standard deviations away from the mean is the median? 



(Solution on p. 82.) 
of 1014 or above? 

(Solution on p. 82.) 



(Solution on p. 82.) 



(Solution on p. 82.) 



(Solution on p. 82.) 



(Solution on p. 82.) 



15 This content is available online at <http://cnx.Org/content/ml7105/l.ll/>. 
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2.13 Practice 3: Interpreting Percentiles 16 

Interpreting Percentiles, Quartiles, and Median 

A percentile indicates the relative standing of a data value when data are sorted into numerical order, from 
smallest to largest. p% of data values are less than or equal to the pth percentile. 

• Low percentiles always correspond to lower data values. 

• High percentiles always correspond to higher data values. 

A percentile may or may not correspond to a value judgment about whether it is "good" or "bad". The 
interpretation of whether a certain percentile is good or bad depends on the context of the situation to 
which the data applies. In some situations, a low percentile would be considered "good'; in other contexts 
a high percentile might be considered "good". In many situations, there is no value judgment that applies. 

Understanding how to properly interpret percentiles is important not only when describing data, but is 
also important in later chapters of this textbook when calculating probabilities. 

Guideline: 

When writing the interpretation of a percentile in the context of the given data, the sentence should contain 
the following information: 

• information about the context of the situation being considered, 

• the data value (value of the variable) that represents the percentile, 

• the percent of individuals or items with data values below the percentile. 

• Additionally, you may also choose to state the percent of individuals or items with data values above 
the percentile. 

Example 2.24 

On a timed math test, the first quartile for times for finishing the exam was 35 minutes. Interpret 
the first quartile in the context of this situation. 

• 25% of students finished the exam in 35 minutes or less. 

• 75% of students finished the exam in 35 minutes or more. 

• A low percentile would be considered good, as finishing more quickly on a timed exam is 
desirable. (If you take too long, you might not be able to finish.) 

Example 2.25 

On a 20 question math test, the 70th percentile for number of correct answers was 16. Interpret 
the 70th percentile in the context of this situation. 

• 70% of students answered 16 or fewer questions correctly. 

• 30% of students answered 16 or more questions correctly. 

• Note: A high percentile would be considered good, as answering more questions correctly 
is desirable. 

Example 2.26 

At a certain community college, it was found that the 30th percentile of credit units that students 
are enrolled for is 7 units. Interpret the 30th percentile in the context of this situation. 

• 30% of students are enrolled in 7 or fewer credit units 

• 70% of students are enrolled in 7 or more credit units 



6 This content is available online at <http://cnx.org/content/ml8845/!. l/>. 
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• In this example, there is no "good" or "bad" value judgment associated with a higher or 
lower percentile. Students attend community college for varied reasons and needs, and their 
course load varies according to their needs. 

Do the following Practice Problems for Interpreting Percentiles 

Exercise 2.13.1 (Solution on p. 82.) 

a. For runners in a race, a low time means a faster run. The winners in a race have the 

shortest running times. Is it more desirable to have a finish time with a high or a low 
percentile when running a race? 

b. The 20th percentile of run times in a particular race is 5.2 minutes. Write a sentence inter- 

preting the 20th percentile in the context of the situation. 

c. A bicyclist in the 90th percentile of a bicycle race between two towns completed the race 

in 1 hour and 12 minutes. Is he among the fastest or slowest cyclists in the race? Write 
a sentence interpreting the 90th percentile in the context of the situation. 

Exercise 2.13.2 (Solution on p. 82.) 

a. For runners in a race, a higher speed means a faster run. Is it more desirable to have a 

speed with a high or a low percentile when running a race? 

b. The 40th percentile of speeds in a particular race is 7.5 miles per hour. Write a sentence 

interpreting the 40th percentile in the context of the situation. 

Exercise 2.13.3 (Solution on p. 83.) 

On an exam, would it be more desirable to earn a grade with a high or low percentile? Explain. 

Exercise 2.13.4 (Solution on p. 83.) 

Mina is waiting in line at the Department of Motor Vehicles (DMV). Her wait time of 32 minutes 
is the 85th percentile of wait times. Is that good or bad? Write a sentence interpreting the 85th 
percentile in the context of this situation. 

Exercise 2.13.5 (Solution on p. 83.) 

In a survey collecting data about the salaries earned by recent college graduates, Li found that her 
salary was in the 78th percentile. Should Li be pleased or upset by this result? Explain. 

Exercise 2.13.6 (Solution on p. 83.) 

In a study collecting data about the repair costs of damage to automobiles in a certain type of 
crash tests, a certain model of car had $1700 in damage and was in the 90th percentile. Should the 
manufacturer and /or a consumer be pleased or upset by this result? Explain. Write a sentence 
that interprets the 90th percentile in the context of this problem. 

Exercise 2.13.7 (Solution on p. 83.) 

The University of California has two criteria used to set admission standards for freshman 
to be admitted to a college in the UC system: 

a. Students' GPAs and scores on standardized tests (SATs and ACTs) are entered into a 

formula that calculates an "admissions index" score. The admissions index score is 
used to set eligibility standards intended to meet the goal of admitting the top 12% 
of high school students in the state. In this context, what percentile does the top 12% 
represent? 

b. Students whose GPAs are at or above the 96th percentile of all students at their high 

school are eligible (called eligible in the local context), even if they are not in the top 
12% of all students in the state. What percent of students from each high school are 
"eligible in the local context"? 
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Exercise 2.13.8 (Solution on p. 83.) 

Suppose that you are buying a house. You and your realtor have determined that the most expen- 
sive house you can afford is the 34th percentile. The 34th percentile of housing prices is $240,000 
in the town you want to move to. In this town, can you afford 34% of the houses or 66% of the 
houses? 
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2.14 Homework Link 17 

Link to homework questions in Homework Collection /Book for Descriptive Statistics Chapter 2 of Collab- 
orative Statistics for R. Bloom 

Chapter 2 Homework Problems ( http://cnx.org/content/ml8645/latest/ ) 18 



17 This content is available online at <http://cnx.Org/content/ml9044/l.l/>. 
18 http://cnx.org/content/ml8645/latest/ 
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Solutions to Exercises in Chapter 2 

Solution to Example 2.2, Problem (p. 43) 

The value 12.3 may be an outlier. Values appear to concentrate at 3 and 4 kilometers. 



Solution to Example 2.7, Problem (p. 48) 

• 3.5 to 4.5 

• 4.5 to 5.5 

• 6 

• 5.5 to 6.5 

Solution to Example 2.9, Problem (p. 52) 
First Data Set 

• Xmin = 32 

• Ql = 56 

• M = 74.5 

• Q3 = 82.5 

• Xmax = 99 



Stem 


Leaf 


1 


15 


2 


357 


3 


23358 


4 


025578 


5 


56 


6 


57 


7 




8 




9 




10 




11 




12 


3 



Table 2.12 



Second Data Set 

• Xmin = 25.5 

• Ql = 78 

• M = 81 

• Q3 = 89 

• Xmax = 98 
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20 30 40 50 60 70 80 90 100 



Solution to Example 2.11, Problem (p. 53) 

For the IQRs, see the answer to the test scores example (Solution to Example 2.9: p. 79). The first data set 
has the larger IQR, so the scores between Q3 and Ql (middle 50%) for the first data set are more spread out 
and not clustered about the median. 

First Data Set 



• 



(|) ■ (IQR) = (|) ■ (26.5) - 39.75 



• Xmax - Q3 = 99 - 82.5 = 16.5 

• Ql - Xmin = 56 - 32 = 24 

(!) ' (IQR) = 39.75 is larger than 16.5 and larger than 24, so the first set has no outliers. 
Second Data Set 



• 



(!) • (IQR) = (!) • (ii) = 16-5 

• Xmax - Q3 = 98 - 89 = 9 

• Ql - Xmin = 78 - 25.5 = 52.5 

(!) ' (IQR) = 16.5 is larger than 9 but smaller than 52.5, so for the second set 45 and 25.5 are outliers. 

To find the percentiles, create a frequency, relative frequency, and cumulative relative frequency chart (see 
"Frequency" from the Sampling and Data Chapter (Section 1.9)). Get the percentiles from that chart. 

First Data Set 

• 30th %ile (between the 6th and 7th values) = (56 \ 59) = 57.5 

• 80th %ile (between the 16th and 17th values) = (84 + 2 84,5) = 84.25 

Second Data Set 

• 30th %ile (7th value) = 78 

• 80th %ile (18th value) = 90 

30% of the data falls below the 30th %ile, and 20% falls above the 80th %ile. 
Solution to Example 2.13, Problem (p. 54) 

1. (i+9) = g5 

2. 9 

3. 6 

4. First Quartile = 25th %ile 

Solution to Exercise 2.6.1 (p. 56) 

a. For runners in a race it is more desirable to have a low percentile for finish time. A low percentile means 
a short time, which is faster. 
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b. INTERPRETATION: 20% of runners finished the race in 5.2 minutes or less. 80% of runners finished the 

race in 5.2 minutes or longer. 

c. He is among the slowest cyclists (90% of cyclists were faster than him.) INTERPRETATION: 90% of 

cyclists had a finish time of 1 hour, 12 minutes or less. Only 10% of cyclists had a finish time of 1 hour, 
12 minutes or longer 

Solution to Exercise 2.6.2 (p. 56) 

a. For runners in a race it is more desirable to have a high percentile for speed. A high percentile means a 

higher speed, which is faster. 

b. INTERPRETATION: 40% of runners ran at speeds of 7.5 miles per hour or less (slower). 60% of runners 

ran at speeds of 7.5 miles per hour or more (faster). 

Solution to Exercise 2.6.3 (p. 56) 

On an exam you would prefer a high percentile; higher percentiles correspond to higher grades on the 
exam. 
Solution to Exercise 2.6.4 (p. 56) 

When waiting in line at the DMV, the 85th percentile would be a long wait time compared to the other 
people waiting. 85% of people had shorter wait times than you did. In this context, you would prefer a 
wait time corresponding to a lower percentile. INTERPRETATION: 85% of people at the DMV waited 32 
minutes or less. 15% of people at the DMV waited 32 minutes or longer. 
Solution to Exercise 2.6.5 (p. 56) 

Li should be pleased. Her salary is relatively high compared to other recent college grads. 78% of recent 
college graduates earn less than Li does. 22% of recent college graduates earn more than Li does. 
Solution to Exercise 2.6.6 (p. 56) 

The manufacturer and the consumer would be upset. This is a large repair cost for the damages, compared 
to the other cars in the sample. INTERPRETATION: 90% of the crash tested cars had damage repair costs 
of $1700 or less; only 10% had damage repair costs of $1700 or more. 
Solution to Exercise 2.6.7 (p. 56) 

a. The top 12% of students are those who are at or above the 88th percentile of admissions index scores. 

b. The top 4% of students' GPAs are at or above the 96th percentile, making the top 4% of students "eligible 

in the local context". 

Solution to Exercise 2.6.8 (p. 57) 

You can afford 34% of houses. 66% of the houses are too expensive for your budget. INTERPRETATION: 
34% of houses cost $240,000 or less. 66% of houses cost $240,000 or more. 

Solutions to Practice 1: Center of the Data 

Solution to Exercise 2.11.1 (p. 71) 

65 
Solution to Exercise 2.11.2 (p. 71) 

1 
Solution to Exercise 2.11.5 (p. 72) 

4.75 

Solution to Exercise 2.11.6 (p. 72) 

1.39 
Solution to Exercise 2.11.7 (p. 72) 

65 
Solution to Exercise 2.11.8 (p. 72) 

4 

Solution to Exercise 2.11.9 (p. 72) 

4 
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Solution to Exercise 2.11.10 (p. 72) 

4 

Solution to Exercise 2.11.11 (p. 72) 

4 

Solution to Exercise 2.11.12 (p. 72) 

6 

Solution to Exercise 2.11.13 (p. 72) 

6-4 = 2 

Solution to Exercise 2.11.14 (p. 72) 

3 

Solution to Exercise 2.11.15 (p. 72) 

6 

Solution to Exercise 2.11.16 (p. 73) 

a. 8.93 

b. 0.58 

Solutions to Practice 2: Spread of the Data 

Solution to Exercise 2.12.1 (p. 74) 

6 

Solution to Exercise 2.12.2 (p. 74) 

a. 1447.5 

b. 528.5 

Solution to Exercise 2.12.3 (p. 74) 

474 FTES 

Solution to Exercise 2.12.4 (p. 74) 

50% 

Solution to Exercise 2.12.5 (p. 74) 

919 

Solution to Exercise 2.12.6 (p. 74) 

0.03 



Solutions to Practice 3: Interpreting Percentiles 

Solution to Exercise 2.13.1 (p. 76) 

a. For runners in a race it is more desirable to have a low percentile for finish time. A low percentile 

means a short time, which is faster. 

b. INTERPRETATION: 20% of runners finished the race in 5.2 minutes or less. 80% of runners finished 

the race in 5.2 minutes or longer. 

c. He is among the slowest cyclists (90% of cyclists were faster than him.) INTERPRETATION: 90% 

of cyclists had a finish time of 1 hour, 12 minutes or less. Only 10% of cyclists had a finish time 
of 1 hour, 12 minutes or longer 

Solution to Exercise 2.13.2 (p. 76) 

a. For runners in a race it is more desirable to have a high percentile for speed. A high percentile 

means a higher speed, which is faster. 

b. INTERPRETATION: 40% of runners ran at speeds of 7.5 miles per hour or less (slower). 60% of 

runners ran at speeds of 7.5 miles per hour or more (faster). 
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Solution to Exercise 2.13.3 (p. 76) 

On an exam you would prefer a high percentile; higher percentiles correspond to higher grades on the 
exam. 
Solution to Exercise 2.13.4 (p. 76) 

When waiting in line at the DMV, the 85th percentile would be a long wait time compared to the other 
people waiting. 85% of people had shorter wait times than you did. In this context, you would prefer a 
wait time corresponding to a lower percentile. INTERPRETATION: 85% of people at the DMV waited 32 
minutes or less. 15% of people at the DMV waited 32 minutes or longer. 
Solution to Exercise 2.13.5 (p. 76) 

Li should be pleased. Her salary is relatively high compared to other recent college grads. 78% of recent 
college graduates earn less than Li does. 22% of recent college graduates earn more than Li does. 
Solution to Exercise 2.13.6 (p. 76) 

The manufacturer and the consumer would be upset. This is a large repair cost for the damages, compared 
to the other cars in the sample. INTERPRETATION: 90% of the crash tested cars had damage repair costs 
of $1700 or less; only 10% had damage repair costs of $1700 or more. 
Solution to Exercise 2.13.7 (p. 76) 

a. The top 12% of students are those who are at or above the 88th percentile of admissions index 

scores. 

b. The top 4% of students' GPAs are at or above the 96th percentile, making the top 4% of students 

"eligible in the local context". 

Solution to Exercise 2.13.8 (p. 77) 

You can afford 34% of houses. 66% of the houses are too expensive for your budget. INTERPRETATION: 

34% of houses cost $240,000 or less. 66% of houses cost $240,000 or more. 
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Chapter 3 

Probability Topics 

3.1 Probability Topics 1 

3.1.1 Student Learning Outcomes 

By the end of this chapter, the student should be able to: 

• Understand and use the terminology of probability. 

• Determine whether two events are mutually exclusive and whether two events are independent. 

• Calculate probabilities using the Addition Rules and Multiplication Rules. 

• Construct and interpret Contingency Tables. 

• Construct and interpret Venn Diagrams (optional). 

• Construct and interpret Tree Diagrams (optional). 



3.1.2 Introduction 

It is often necessary to "guess" about the outcome of an event in order to make a decision. Politicians study 
polls to guess their likelihood of winning an election. Teachers choose a particular course of study based 
on what they think students can comprehend. Doctors choose the treatments needed for various diseases 
based on their assessment of likely results. You may have visited a casino where people play games chosen 
because of the belief that the likelihood of winning is good. You may have chosen your course of study 
based on the probable availability of jobs. 

You have, more than likely, used probability. In fact, you probably have an intuitive sense of probability. 
Probability deals with the chance of an event occurring. Whenever you weigh the odds of whether or not 
to do your homework or to study for an exam, you are using probability. In this chapter, you will learn to 
solve probability problems using a systematic approach. 

3.1.3 Optional Collaborative Classroom Exercise 

Your instructor will survey your class. Count the number of students in the class today. 

• Raise your hand if you have any change in your pocket or purse. Record the number of raised hands. 

• Raise your hand if you rode a bus within the past month. Record the number of raised hands. 

• Raise your hand if you answered "yes" to BOTH of the first two questions. Record the number of 
raised hands. 



1 This content is available online at <http://cnx.Org/content/ml6838/l.ll/>. 
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Use the class data as estimates of the following probabilities. P(change) means the probability that a ran- 
domly chosen person in your class has change in his/her pocket or purse. P(bus) means the probability that 
a randomly chosen person in your class rode a bus within the last month and so on. Discuss your answers. 

• Find P(change). 

• Find P(bus). 

• Find P(change and bus) Find the probability that a randomly chosen student in your class has change 
in his/her pocket or purse and rode a bus within the last month. 

• Find P(change I bus) Find the probability that a randomly chosen student has change given that 
he/she rode a bus within the last month. Count all the students that rode a bus. From the group 
of students who rode a bus, count those who have change. The probability is equal to those who have 
change and rode a bus divided by those who rode a bus. 



3.2 Terminology (modified R. Bloom) 2 

Probability measures the uncertainty that is associated with the outcomes of a particular experiment or 
activity. An experiment is a planned operation carried out under controlled conditions. If the result is not 
predetermined, then the experiment is said to be a chance experiment. Flipping one fair coin is an example 
of an experiment. 

The result of an experiment is called an outcome. A sample space is a set of all possible outcomes. Three 
ways to represent a sample space are to list the possible outcomes, to create a tree diagram, or to create a 
Venn diagram. The uppercase letter S is used to denote the sample space. For example, if you flip one fair 
coin, S = {H, T} where H = heads and T = tails are the outcomes. 

An event is any combination of outcomes. Upper case letters like A and B represent events. For example, 
if the experiment is to flip one fair coin, event A might be getting at most one head. The probability of an 
event A is written P (A). 

The probability of any outcome is the long-term relative frequency of that outcome. Probabilities are 

between and 1, inclusive (includes and 1 and all numbers between these values). P (A) = means 
the event A can never happen. P (A) = 1 means the event A always happens. P (A) = 0.5 means the 
event A is equally likely to occur or not to occur. For example, if you flip one fair coin repeatedly (from 20 
to 2,000 to 20,000 times) the relative fequency of heads approaches 0.5 (the probability of heads). 

Equally likely means that each outcome of an experiment occurs with equal probability. For example, if 
you toss a fair, six-sided die, each face (1, 2, 3, 4, 5, or 6) is as likely to occur as any other face. If you 
toss a fair coin, a Head(H) and a Tail(T) are equally likely to occur. If you randomly guess the answer to a 
true /false question on an exam, you are equally likely to select a correct answer or an incorrect answer. 

To calculate the probability of an event A when all outcomes in the sample space are equally likely, 

count the number of outcomes for event A and divide by the total number of outcomes in the sample space. 
For example, if you toss a fair dime and a fair nickel, the sample space is {HH, TH, HT, TT} where T = 
tails and H = heads. The sample space has four outcomes. A = getting one head. There are two outcomes 
{HT, TH}. P(A) =\. 



Suppose you roll one fair six-sided die, with the numbers {1,2,3,4,5,6} on its faces. Let event E = rolling 
a number that is at least 5. There are two outcomes {5, 6}. P (E) — |. If you were to roll the die only a 
few times, you would not be surprised if your observed results did not match the probability. If you were 
to roll the die a very large number of times, you would expect that, overall, 2/6 of the rolls would result 



2 This content is available online at <http://cnx.Org/content/ml8869/l.l/>. 
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an outcome of "at least 5". The long-term relative frequency of obtaining this result would approach the 
theoretical probability of 2/6 as the number of repetitions grows larger and larger. 

This important characteristic of probability experiments is the known as the Law of Large Numbers: as 
the number of repetitions of an experiment is increased, the relative frequency obtained in the experiment 
tends to become closer and closer to the theoretical probability. Even though the outcomes don't happen 
according to any set pattern or order, overall, the long-term observed relative frequency will approach the 
theoretical probability. (The word empirical is often used instead of the word observed.) The Law of Large 
Numbers will be discussed again in Chapter 7. 

It is important to realize that in many situations, the outcomes are not equally likely. A coin or die may 
be unfair, or biased . Two math professors in Europe had their statistics students test the Belgian 1 Euro 
coin and discovered that in 250 trials, a head was obtained 56% of the time and a tail was obtained 44% 
of the time. The data seem to show that the coin is not a fair coin; more repetitions would be helpful to 
draw a more accurate conclusion about such bias. Some dice may be biased. Look at the dice in a game you 
have at home; the spots on each face are usually small holes carved out and then painted to make the spots 
visible. Your dice may or may not be biased; it is possible that the outcomes may be affected by the slight 
weight differences due to the different numbers of holes in the faces. Gambling casinos have a lot of money 
depending on outcomes from rolling dice, so casino dice are made differently to eliminate bias. Casino dice 
have flat faces; the holes are completely filled with paint having the same density as the material that the 
dice are made out of so that each face is equally likely to occur. Later in this chapter we will learn techniques 
to use to work with probabilities for events that are not equally likely. 

"OR" Event: An outcome is in the event A OR B if the outcome is in A or is in B or is in both A and B. For 
example, let A = {1, 2, 3, 4, 5} and B = {4, 5, 6, 7, 8}. A ORB = {1, 2, 3, 4, 5, 6, 7, 8}. Notice that 
4 and 5 are NOT listed twice. 

"AND" Event: An outcome is in the event A AND B if the outcome is in both A and B at the same time. 
For example, let A and B be {1, 2, 3, 4, 5} and {4, 5, 6, 7, 8}, respectively. Then A AND B = {4,5}. 

The complement of event A is denoted A' (read "A prime"). A' consists of all outcomes that are NOT in A. 
Notice that P (A) + P (A') = 1. For example, let S = {1, 2, 3, 4, 5, 6} and let A = {1, 2, 3, 4}. Then, 
A' = {5, 6}. P (A) =\,P (A') =1 and P (A) + P (A') =\ + §= 1 

The conditional probability of A given B is written P (A\B). The probability of A is calculated knowing 
that B has already occurred. A conditional reduces the sample space. We calculate the probability of A 
from the reduced sample space B. The formula to calculate P (A\B) is 

P(A\B) = P -^± 

where P (B) is greater than 0. 

For example, suppose we toss one fair, six-sided die. The sample space S = {1, 2, 3, 4, 5, 6}. Let A = 
face is 2 or 3 and B = face is even (2, 4, 6). To calculate P (A\B), we count the number of outcomes 2 or 3 in 
the sample space B = {2, 4, 6}. Then we divide that by the number of outcomes in B (and not S). 

We get the same result by using the formula. Remember that S has 6 outcomes. 

p ( a\t>\ P(A and B) (the number of outcomes that are 2 or 3 and even in S) / 6 1/6 1 

V I / P(B) (the number of outcomes that are even in S) / 6 3/6 3 

Understanding Terminology and Symbols 

It is important to read each problem carefully to think about and understand what the events are. Under- 
standing the wording is the first very important step in solving probability problems. Reread the problem 
several times if necessary. Clearly identify the event of interest. Determine whether there is a condition 
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stated in the wording that would indicate that the probability is conditional; carefully identify the condi- 
tion, if any. 

Exercise 3.2.1 (Solution on p. 109.) 

In a particular college class, there are male and female students. Some students have long hair and 
some students have short hair. Write the symbols for the probabilities of the events f or parts (a) 
through (j) below. (Note that you can't find numerical answers here. You were not given enough 
information to find any probability values yet; concentrate on understanding the symbols.) 

• Let F be the event that a student is female. 

• Let M be the event that a student is male. 

• Let S be the event that a student has short hair. 

• Let L be the event that a student has long hair. 

a. The probability that a student does not have long hair. 

b. The probability that a student is male or has short hair. 

c. The probability that a student is a female and has long hair. 

d. The probability that a student is male, given that the student has long hair. 

e. The probability that a student is has long hair, given that the student is male. 

f . Of all the female students, the probability that a student has short hair. 

g. Of all students with long hair, the probability that a student is female. 
h. The probability that a student is female or has long hair. 

i. The probability that a randomly selected student is a male student with short hair. 
j. The probability that a student is female. 



3.3 Independent and Mutually Exclusive Events (modified R. Bloom) 3 

3.3.1 Independent Events 

Two events are independent if the following are true: 

• P(A\B) = P(A) 

• P(B\A) = P(B) 

• P(AANDB) = P(A) ■ P(B) 

If events A and B are independent, then the chance of A occurring does not affect the chance of B occurring 
and vice versa. 

Translating the symbols into words, the first two mathematical statements listed above say that the proba- 
bility for the event with the condition is the same as the probability for the event without the condition. For 
independent events: the condition does not change the probability for the event. 

For example, two roles of a fair die are independent events. The outcome of the first roll does not change 
the probability for the outcome of the second roll. 

If you select 2 cards consecutively from a complete deck of playing cards, the two selections are not in- 
dependent; the result of the first selection changes the remaining deck and affects the probabilities for the 
second selection. This is referred to as selecting "without replacement"; the first card has not been replaced 
into the deck before the second card is selected. 



3 This content is available online at <http://cnx.0rg/content/ml8868/l.l/>. 



However, suppose you were to select 2 cards "with replacement", by returning your first card to the deck 
and shuffling the deck before selecting the second card. Because the deck of cards is complete for both 
selections, the first selection does not affect the probability of the second selection. When selecting cards 
with replacement, the selections are independent. 

To show that two events are independent, you must show only one of the conditions listed above. If any 
one of these conditions is true, then all of them are true. 

3.3.2 Mutually Exclusive Events 

Events A and B are mutually exclusive events if they cannot occur at the same time. This means that A and 
B do not share any outcomes and P(A AND B) = 0. 

For example, suppose the sample space S = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}. 

Let A = {1, 2, 3, 4, 5}, B = {4, 5, 6, 7, 8}, and C = {7, 9}. 

A AND B = {4, 5}. P(A AND B) = ^ and is not equal to zero. Therefore, A and B are not mutually 

exclusive. 
A and C do not have any numbers in common so P(A AND C) = 0. Therefore, A and C are mutually 

exclusive. 

NOTE: Independent and mutually exclusive do not mean the same thing. 

You must show that any two events are independent or mutually exclusive. You cannot assume either of 
these conditions. 

If it is not known whether A and B are independent or dependent, assume they are dependent until you 
can show otherwise. 

The following examples illustrate these definitions and terms. 

Example 3.1 

Flip two fair coins. (This is an experiment.) 

The sample space is {HH, HT, TH, TT} where T = tails and H = heads. The outcomes are HH, 
HT, TH, and TT. The outcomes HT and TH are different. The HT means that the first coin 
showed heads and the second coin showed tails. The TH means that the first coin showed tails 
and the second coin showed heads. 

• Let A = the event of getting at most one tail. (At most one tail means or 1 tail.) Then A can 
be written as {HH, HT, TH}. The outcome HH shows tails. HT and TH each show 1 tail. 

• Let B = the event of getting all tails. B can be written as {TT}. B is the complement of A. So, 
B = A'. Also, P (A) + P(B) = P(A) + P (A') = 1. 

• The probabilities for A and for B are P (A) — | and P (B) = \. 

• Let C = the event of getting all heads. C = {HH}. Since B = {TT}, P (B AND C) = 0. 
B and C are mutually exclusive. (B and C have no members in common because you cannot 
have all tails and all heads at the same time.) 

• Let D = event of getting more than one tail. D = {TT}. P (D) = ^. 

• Let E = event of getting a head on the first roll. (This implies you can get either a head or tail 
on the second roll.) E = {HT,HH}. P (E) = |. 

• Find the probability of getting at least one (1 or 2) tail in two flips. Let F = event of getting 
at least one tail in two flips. F = {HT, TH, TT}. P(F) = \ 

Example 3.2 

Roll one fair 6-sided die. The sample space is {1, 2, 3, 4, 5, 6}. Let event A = a face is odd. Then 

A = {1, 3, 5}. Let event B = a face is even. Then B — {2, 4, 6}. 
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• Find the complement of A, A'. The complement of A, A', is B because A and B together 
make up the sample space. P(A) + P(B) = P(A) + P(A') = 1. Also, P(A) = f and P(B) = § 

• Let event C = odd faces larger than 2. Then C = {3,5}. Let event D = all even faces smaller 
than 5. Then D = {2, 4}. P(C and D) = because you cannot have an odd and even face at 
the same time. Therefore, C and D are mutually exclusive events. 

• Let event E = all faces less than 5. E ={1,2,3,4}. 

Problem (Solution on p. 109.) 

Are C and E mutually exclusive events? (Answer yes or no.) Why or why not? 

• Find P(C I A). This is a conditional. Recall that the event C is {3, 5} and event A is {1, 3, 5}. 
To find P(C I A), find the probability of C using the sample space A. You have reduced the 
sample space from the original sample space {1, 2, 3, 4, 5, 6} to {1, 3, 5}. So, P(C I A) = | 

Example 3.3 

Let event G = taking a math class. Let event H = taking a science class. Then, G AND H = taking 
a math class and a science class. Suppose P(G) = 0.6, P(H) = 0.5, and P(G AND H) = 0.3. Are 
G and H independent? 

If G and H are independent, then you must show ONE of the following: 

• P(GIH) = P(G) 

• P(HIG) = P(H) 

• P(G AND H) = P(G) • P(H) 

NOTE: The choice you make depends on the information you have. You could choose any of the 
methods here because you have the necessary information. 

Problem 1 

Show that P(G I H) = P(G). 

Solution 

P ( GIH) = p(G ffy H) = jg = o.6 = P(G) 

Problem 2 

Show P(G AND H) = P(G) • P(H). 

Solution 

P(G)-P (H) = 0.6 • 0.5 = 0.3 = P(G AND H) 

Interpretation of results 

Since G and H are independent, then, knowing that a person is taking a science class does not 
change the chance that he/she is taking math. (Note: IF the two events had not been independent 
- that is, IF they were dependent - then knowing that a person is taking a science class would 
change the chance he/she is taking math. The next example will illustrate two events that are not 
independent.) 

Example 3.4 

In a particular college class, 60% of the students are female. 50 % of all students in the class have 
long hair. 45% of the students are female and have long hair. Of the female students, 75% have 
long hair. Let F be the event that the student is female. Let L be the event that the student has long 
hair. Are the events of being female and having long hair independent? 

The following probabilities are given in this example: 
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P(F) =0.60;P(L) =0.50 
P(FANDL) =0.45 
P(L I F) = 0.75 

If F and L are independent, then the following conditions are true. BUT if you show that any of 
the conditions are not true, then F and L are not independent. 

• P(LIF) = P(L) 

• P(FIL) = P(F) 

• P(F AND L) = P(F) • P(L) 

NOTE: The choice you make depends on the information you have. You could use the first or 
last condition on the list for this example. You do not know P(F I L) yet, so you can not use the 
second condition. 

Solution 1 

Check whether P(F and L) = P(F)P(L): We are given that P(F and L) = 0.45 ; but P(F)P(L) = 
(0.60)(0.50)= 0.30 The events of being female and having long hair are not independent because 
P(F and L) does not equal P(F)P(L). 

Solution 2 

Check whether P(L I F) equals P(L): We are given that P(L I F) = 0.75 but P(L) = 0.50; they are not 

equal. The events of being female and having long hair are not independent. 

Interpretation of results 

The events of being female and having long hair are not independent; knowing that a student is 
female changes the probability that a student has long hair. 

Example 3.5 

In a box there are 3 red cards and 5 blue cards. The red cards are marked with the numbers 1, 2, 
and 3, and the blue cards are marked with the numbers 1, 2, 3, 4, and 5. The cards are well-shuffled. 
You reach into the box (you cannot see into it) and draw one card. 

Let R = red card is drawn, B = blue card is drawn, E = even-numbered card is drawn. 

The sample space S = Rl, R2, R3, Bl, B2, B3, B4, B5. S has 8 outcomes. 



• 



P(R) = §. P(B) = |. P(R AND B) = 0. (You cannot draw one card that is both red and blue.) 



• P(E) = g. (There are 3 even-numbered cards, R2, B2, and B4.) 

• P(E I B) — 1. (There are 5 blue cards: Bl, B2, B3, B4, and B5. Out of the blue cards, there are 
2 even cards: B2 and B4.) 

• P(B I E) — |. (There are 3 even-numbered cards: R2, Bl, and B4. Out of the even-numbered 
cards, 2 are blue: B2 and B4.) 

• The events R and B are mutually exclusive because P(R AND B) =0. 

• Let G = card with a number greater than 3. G = {B4, B5}. P(G) = |. Let H = blue card 
numbered between 1 and 4, inclusive. H = {Bl, B2, B3, B4}. P(G I H) = \. (The only card in 
H that has a number greater than 3 is B4.) Since g = 5, P(G) = P(G I H) which means that 
G and H are independent. 



92 CHAPTER 3. PROBABILITY TOPICS 

3.4 Two Basic Rules of Probability 4 

3.4.1 The Multiplication Rule 

If A and B are two events defined on a sample space, then: P(A AND B) = P(B) • P(A I B). 

This rule may also be written as : P (A\B) = — — pr^r — - 

(The probability of A given B equals the probability of A and B divided by the probability of B.) 

If A and B are independent, then P(A I B) = P(A). Then P(A AND B) = P(A I B) P(B) becomes 
P(A AND B) = P(A) P(B). 

3.4.2 The Addition Rule 

If A and B are defined on a sample space, then: P(A OR B) = P(A) + P(B) - P(A AND B). 

If A and B are mutually exclusive, then P(A AND B) = 0. Then P(A OR B) = P(A) + P(B) - P(A AND B) 
becomes P(A OR B) = P(A) + P(B). 

Example 3.6 

Klaus is trying to choose where to go on vacation. His two choices are: A = New Zealand and B 

= Alaska 

• Klaus can only afford one vacation. The probability that he chooses A is P(A) = 0.6 and the 
probability that he chooses B is P(B) = 0.35. 

• P(A and B) = because Klaus can only afford to take one vacation 

• Therefore, the probability that he chooses either New Zealand or Alaska is P(A OR B) = 
P(A) + P(B) = 0.6 + 0.35 = 0.95. Note that the probability that he does not choose to go 
anywhere on vacation must be 0.05. 

Example 3.7 

Carlos plays college soccer. He makes a goal 65% of the time he shoots. Carlos is going to attempt 

two goals in a row in the next game. 

A = the event Carlos is successful on his first attempt. P(A) = 0.65. B = the event Carlos is 
successful on his second attempt. P(B) = 0.65. Carlos tends to shoot in streaks. The probability 
that he makes the second goal GIVEN that he made the first goal is 0.90. 

Problem 1 

What is the probability that he makes both goals? 

Solution 

The problem is asking you to find P(A AND B) = P(B AND A). Since P(B I A) = 0.90: 

P(B AND A) = P(B I A) P(A) = 0.90 * 0.65 = 0.585 (3.1) 

Carlos makes the first and second goals with probability 0.585. 

Problem 2 

What is the probability that Carlos makes either the first goal or the second goal? 



4 This content is available online at <http://cnx.Org/content/ml6847/l.ll/>. 
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Solution 

The problem is asking you to find P(A OR B). 

P(A OR B) = P(A) + P(B) - P(A AND B) = 0.65 + 0.65 - 0.585 = 0.715 (3.2) 

Carlos makes either the first goal or the second goal with probability 0.715. 

Problem 3 

Are A and B independent? 

Solution 

No, they are not, because P(B AND A) = 0.585. 

P(B) • P(A) = (0.65) • (0.65) = 0.423 (3.3) 

0.423 / 0.585 = P(B AND A) (3.4) 

So, P(B AND A) is not equal to P(B) • P(A). 

Problem 4 

Are A and B mutually exclusive? 

Solution 

No, they are not because P(A and B) = 0.585. 

To be mutually exclusive, P(A AND B) must equal 0. 



Example 3.8 

A community swim team has 150 members. Seventy-five of the members are advanced swim- 
mers. Forty-seven of the members are intermediate swimmers. The remainder are novice swim- 
mers. Forty of the advanced swimmers practice 4 times a week. Thirty of the intermediate swim- 
mers practice 4 times a week. Ten of the novice swimmers practice 4 times a week. Suppose one 
member of the swim team is randomly chosen. Answer the questions (Verify the answers): 

Problem 1 

What is the probability that the member is a novice swimmer? 

Solution 

28 

150 

Problem 2 

What is the probability that the member practices 4 times a week? 

Solution 

80 

150 

Problem 3 

What is the probability that the member is an advanced swimmer and practices 4 times a week? 



94 CHAPTER 3. PROBABILITY TOPICS 

Solution 

40 

150 

Problem 4 

What is the probability that a member is an advanced swimmer and an intermediate swimmer? 
Are being an advanced swimmer and an intermediate swimmer mutually exclusive? Why or why 
not? 

Solution 

P(advanced AND intermediate) = 0, so these are mutually exclusive events. A swimmer cannot 
be an advanced swimmer and an intermediate swimmer at the same time. 



Problem 5 

Are being a novice swimmer and practicing 4 times a week independent events? Why or why 
not? 

Solution 

No, these are not independent events. 

P(novice AND practices 4 times per week) = 0.0667 (3.5) 

P(novice) • P(practices 4 times per week) = 0.0996 (3.6) 

0.0667 jL 0.0996 (3.7) 



Example 3.9 

Studies show that, if she lives to be 90, about 1 woman in 7 (approximately 14.3%) will develop 
breast cancer. Suppose that of those women who develop breast cancer, a test is negative 2% of the 
time. Also suppose that in the general population of women, the test for breast cancer is negative 
about 85% of the time. Let B = woman develops breast cancer and let N = tests negative. Suppose 
one woman is selected at random. 

Problem 1 

What is the probability that the woman develops breast cancer? What is the probability that 
woman tests negative? 

Solution 

P(B) = 0.143 ; P(N) = 0.85 



Problem 2 

Given that the woman has breast cancer, what is the probability that she tests negative? 

Solution 

P(N I B) = 0.02 

Problem 3 

What is the probability that the woman has breast cancer AND tests negative? 
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Solution 

P(B AND N) = P(B) • P(N I B) = (0.143) • (0.02) = 0.0029 

Problem 4 

What is the probability that the woman has breast cancer or tests negative? 

Solution 

P(B OR N) = P(B) + P(N) - P(B AND N) = 0.143 + 0.85 - 0.0029 = 0.9901 

Problem 5 

Are having breast cancer and testing negative independent events? 

Solution 

No. P(N) = 0.85; P(N I B) = 0.02. So, P(N I B) does not equal P(N) 

Problem 6 

Are having breast cancer and testing negative mutually exclusive? 

Solution 

No. P(B AND N) = 0.0029. For B and N to be mutually exclusive, P(B AND N) must be 0. 



3.5 Contingency Tables (modified R. Bloom) 5 

A contingency table provides a different way of calculating probabilities. The table helps in determining 
conditional probabilities quite easily. The table displays sample values in relation to two different variables 
that may be dependent or contingent on one another. Later on, we will use contingency tables again, but in 
another manner. 

Example 3.10 

Suppose a study of speeding violations and drivers who use car phones produced the following 
fictional data: 





Speeding violation in 
the last year 


No speeding violation 
in the last year 


Total 


Car phone user 


25 


280 


305 


Not a car phone user 


45 


405 


450 


Total 


70 


685 


755 



Table 3.1 

The total number of people in the sample is 755. The row totals are 305 and 450. The column totals 
are 70 and 685. Notice that 305 + 450 = 755 and 70 + 685 = 755. 



5 This content is available online at <http://cnx.Org/content/ml8859/l.l/>. 
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Calculate the following probabilities using the table 

Problem 1 

P(person is a car phone user) = 

Solution 

number of car phone users 305 

total number in study 755 

Problem 2 

P(person had no violation in the last year) = 

Solution 

number that had no violation 685 

total number in study 755 

Problem 3 

P(person had no violation in the last year AND was a car phone user) = 

Solution 

280 

755 

Problem 4 

P(person is a car phone user OR person had no violation in the last year) 

Solution 

/ 305 , 685\ 280 _ 710 
V 755 T 755 ) 755 ~~ 755 



Problem 5 

P(person is a car phone user GIVEN person had a violation in the last year) = 

Solution 

|jj (The sample space is reduced to the number of persons who had a violation.) 



Problem 6 

P(person had no violation last year GIVEN person was not a car phone user) = 

Solution 

|j=jj (The sample space is reduced to the number of persons who were not car phone users.) 



Example 3.11 

The following table shows a random sample of 100 hikers and the areas of hiking preferred: 



Hiking Area Preference 



Sex 


The Coastline 


Near Lakes and Streams 


On Mountain Peaks 


Total 


Female 


18 


16 





45 


Male 








14 


55 


Total 





41 
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Table 3.2 

Problem 1 (Solution on p. 109.) 

Complete the table. 

Problem 2 (Solution on p. 109.) 

Are the events "being female" and "preferring the coastline" independent events? 

Let F = being female and let C = preferring the coastline. 

a. P(F AND C) = 

b. P(F) ■ P(C) = 

Are these two numbers the same? If they are, then F and C are independent. If they are not, then 
F and C are not independent. 

Problem 3 (Solution on p. 109.) 

Find the probability that a person is male given that the person prefers hiking near lakes and 
streams. Let M = being male and let L = prefers hiking near lakes and streams. 

a. What word tells you this is a conditional? 

b. Fill in the blanks and calculate the probability: P( I ) = . 

c. Is the sample space for this problem all 100 hikers? If not, what is it? 

Problem 4 (Solution on p. 109.) 

Find the probability that a person is female or prefers hiking on mountain peaks. Let F = being 
female and let P = prefers mountain peaks. 

a. P(F) = 

b. P(P) = 

c. P(FANDP) = 

d. Therefore, P(F OR P) = 



3.6 Venn Diagrams (optional) 6 

A Venn diagram is a picture that represents the outcomes of an experiment. It generally consists of a box 
that represents the sample space S together with circles or ovals. The circles or ovals represent events. 

Example 3.12 

Suppose an experiment has the outcomes 1, 2, 3, ... , 12 where each outcome has an equal chance 
of occurring. Let event A = jl, 2, 3, 4, 5, 6} and event B = (6, 7, 8, 9}. Then A AND B = {6} 
and A ORB = jl, 2, 3, 4, 5, 6, 7, 8, 9}. The Venn diagram is as follows: 



6 This content is available online at <http://cnx.Org/content/ml6848/l.12/>. 
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A 




ID 






B 


11 


V 


2 3/ 
4 5 C 


V 7 8 


•^ 12 



Example 3.13 

Flip 2 fair coins. Let A = tails on the first coin. Let B = tails on the second coin. Then A — {TT, TH} 
and B = {TT,HT}. Therefore, A AND B = {TT}. A ORB = {TH,TT,HT}. 

The sample space when you flip two fair coins is S = {HH, HT, TH, TT}. The outcome HH is in 
neither A nor B. The Venn diagram is as follows: 



s 



A 


y^ 


~^ 


B 




€ 


§ 


HT 


) 


HH 



Example 3.14 

Forty percent of the students at a local college belong to a club and 50% work part time. Five 
percent of the students work part time and belong to a club. Draw a Venn diagram showing the 
relationships. Let C = student belongs to a club and PT = student works part time. 




If a student is selected at random find 

• The probability that the student belongs to a club. P(C) = 0.40. 
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• The probability that the student works part time. P(PT) = 0.50. 

• The probability that the student belongs to a club AND works part time. P(C AND PT) 
0.05. 

• The probability that the student belongs to a club given that the student works part time. 



P(CIPT) = 



P(C AND PT) 0.05 



= 0.1 



P(PT) 0.50 

The probability that the student belongs to a club OR works part time. 
P(CORPT) = P(C) + P(PT) - P(CANDPT) = 0.40 + 0.50 



0.05 = 0.85 



(3.8) 



(3.9) 



3.7 Tree Diagrams 7 

A tree diagram is a special type of graph used to determine the outcomes of an experiment. It consists of 
"branches" that are labeled with either frequencies or probabilities. Tree diagrams can make some probabil- 
ity problems easier to visualize and solve. The following example illustrates how to use a tree diagram. 

Example 3.15 

In an urn, there are 11 balls. Three balls are red (R) and 8 balls are blue (B). Draw two balls, one 
at a time, with replacement. "With replacement" means that you put the first ball back in the urn 
before you select the second ball. The tree diagram using frequencies that show all the possible 
outcomes follows. 





■■M 










SB 


/ \3R 


I s1 Draw 


i 


m/ 




\3E 


mX 


v , D 2 n,, Draw 


& 


4BB 




241 


3R 24RB 


9RR 



Figure 3.1: Total = 64 + 24 + 24 + 9 = 121 



The first set of branches represents the first draw. The second set of branches represents the second 
draw. Each of the outcomes is distinct. In fact, we can list each red ball as R\, R2, and _R3 and each 
blue ball as Bl, B2, B3, B4, B5, B6, B7, and B8. Then the 9 RR outcomes can be written as: 



7 This content is available online at <http://cnx.Org/content/ml6846/l.10/>. 
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R1R1; R1R2; R1R3; R2R1; R2R2; R2R3; R3R1; R3R2; R3R3 

The other outcomes are similar. 

There are a total of 11 balls in the urn. Draw two balls, one at a time, and with replacement. There 
are 11 • 11 = 121 outcomes, the size of the sample space. 

Problem 1 (Solution on p. 109.) 

List the 24 BR outcomes: B1R1, B1R2, B1R3, ... 

Problem 2 

Using the tree diagram, calculate P(RR). 

Solution 

1 (KK) = yi ' n ~ I2T 

Problem 3 

Using the tree diagram, calculate P(RB OR BR). 

Solution 

P(RB OR BR) = ^ • £ + £ • ^ = ^ 

Problem 4 

Using the tree diagram, calculate P(R on 1st draw AND B on 2nd draw). 

Solution 

P(R on 1st draw AND B on 2nd draw) = P(RB) = n ' n = m 

Problem 5 

Using the tree diagram, calculate P(R on 2nd draw given B on 1st draw). 

Solution 

P(R on 2nd draw given B on 1st draw) = P(R on 2nd I B on 1st) = It = £■ 

This problem is a conditional. The sample space has been reduced to those outcomes that already 
have a blue on the first draw. There are 24 + 64 = 88 possible outcomes (24 BR and 64 BB). 
Twenty -four of the 88 possible outcomes are BR. || = ^-. 

Problem 6 (Solution on p. 109.) 

Using the tree diagram, calculate P(BB). 

Problem 7 (Solution on p. 110.) 

Using the tree diagram, calculate P(B on the 2nd draw given R on the first draw). 

Example 3.16 

An urn has 3 red marbles and 8 blue marbles in it. Draw two marbles, one at a time, this time 
without replacement from the urn. "Without replacement" means that you do not put the first 
ball back before you select the second ball. Below is a tree diagram. The branches are labeled with 
probabilities instead of frequencies. The numbers at the ends of the branches are calculated by 

6 

no- 



multiplying the numbers on the two corresponding branches, for example, fy ' TO _ ~ 
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B 
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R 
3 


1 M Draw 


B 


11/ 


R 


B 


V 11 


R 


7 
10/ 




3 
\10 


8 
10/ 




_2_ 
^v 2 a Diaw 

















56 

no 

BB 



24 24 



110 

BR 



110 
RB 



110 
RR 



Figure 3.2: Total 



56 + 24 + 24 - 
110 



110 
110 



NOTE: If you draw a red on the first draw from the 3 red possibilities, there are 2 red left to draw 
on the second draw. You do not put back or replace the first ball after you have drawn it. You 
draw without replacement, so that on the second draw there are 10 marbles left in the urn. 

Calculate the following probabilities using the tree diagram. 

Problem 1 

P(RR) = 



Solution 

P(RR) =1 -^ 



6 
110 



Problem 2 

Fill in the blanks: 



48 



P(RB OR BR) = £ • & + (_)(_) = m 

Problem 3 

P(R on 2d I B on 1st) = 

Problem 4 

Fill in the blanks: 

P(R on 1st and B on 2nd) = P(RB) = ( )(_ 

Problem 5 

P(BB) = 

Problem 6 

P(B on 2nd I R on 1st) = 



J = 



24 
110 



(Solution on p. 110.) 

(Solution on p. 110.) 
(Solution on p. 110.) 

(Solution on p. 110.) 
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Solution 

There are 6 + 24 outcomes that have R on the first draw (6 RR and 24 RB). The 6 and the 24 
are frequencies. They are also the numerators of the fractions jig and ^4. The sample space is no 
longer 110 but 6 + 24 = 30. Twenty-four of the 30 outcomes have B on the second draw. The 
probability is then §J. Did you get this answer? 



If we are using probabilities, we can label the tree in the following general way. 




P(B|B) 



P(R|R) 



P(BiUKiB)=?(BB) 






P(E.aidE> = ?;RE.) 



• P(R I R) here means P(R on 2nd I R on 1st) 

• P(B I R) here means P(B on 2nd I R on 1st) 

• P(R I B) here means P(R on 2nd I B on 1st) 

• P(B I B) here means P(B on 2nd I B on 1st) 
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3.8 Summary of Formulas 8 



Formula 3.1: Complement 

If A and A' are complements then P (A) + P(A' ) = 1 

Formula 3.2: Addition Rule 

P(A OR B) = P(A) + P(B) - P(A AND B) 

Formula 3.3: Mutually Exclusive 

If A and B are mutually exclusive then P(A AND B) = ; so P(A OR B) = P(A) + P(B). 

Formula 3.4: Multiplication Rule 

• P(A AND B) = P(B)P(A I B) 

• P(A AND B) = P(A)P(B I A) 

Formula 3.5: Independence 

If A and B are independent then: 

• P(A I B) = P(A) 

• P(B I A) = P(B) 

• P(A AND B) = P(A)P(B) 



8 This content is available online at <http://cnx.org/content/ml6843/1.5/>. 
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3.9 Practice 1: Contingency Tables (modified R. Bloom) 9 

3.9.1 Student Learning Objectives 

• The student will practice constructing and interpreting contingency tables. 

3.9.2 Given 

An article in the New England Journal of Medicine (by Haiman, Stram, Wilkens, Pike, et ah, 1/26/06), 
reported about a study of smokers in California and Hawaii. In one part of the report, the self-reported 
ethnicity and smoking levels per day were given. Of the people smoking at most 10 cigarettes per day 
there were 9886 African Americans, 2745 Native Hawaiians, 12,831 Latinos, 8378 Japanese Americans, and 
7650 Whites. Of the people smoking 11-20 cigarettes per day, there were 6514 African Americans, 3062 
Native Hawaiians, 4932 Latinos, 10,680 Japanese Americans, and 9877 Whites. Of the people smoking 
21-30 cigarettes per day, there were 1671 African Americans, 1419 Native Hawaiians, 1406 Latinos, 4715 
Japanese Americans, and 6062 Whites. Of the people smoking at least 31 cigarettes per day, there were 759 
African Americans, 788 Native Hawaiians, 800 Latinos, 2305 Japanese Americans, and 3970 Whites. 

3.9.3 Complete the Table 

Verify that the data above is entered into the table correctly. 

Smoking Levels by Ethnicity 



Smoking 
Level 


African 
American 


Native 
Hawaiian 


Latino 


Japanese 
Americans 


White 


TOTALS 


1-10 


9886 


2745 


12831 


8378 


7650 


41490 


11-20 


6514 


3062 


4932 


10680 


9877 


35065 


21-30 


1671 


1419 


1406 


4715 


6062 


15273 


31+ 


759 


788 


800 


2305 


3970 


8622 


TOTALS 


18830 


8014 


19969 


26078 


27559 


100450 



Table 3.3 



3.9.4 Analyze the Data 

Suppose that one person from the study is randomly selected. 

Exercise 3.9.1 

Find the probability that person smoked 11-20 cigarettes per day. 

Exercise 3.9.2 

Find the probability that person was Latino. 



(Solution on p. 110.) 



(Solution on p. 110.) 



9 This content is available online at <http://cnx.org/content/ml8926/!. l/>. 
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3.9.5 Discussion Questions 

Exercise 3.9.3 (Solution on p. 110.) 

In words, explain what it means to pick one person from the study and that person is "Japanese 
American AND smokes 21-30 cigarettes per day." Also, find the probability. 

Exercise 3.9.4 (Solution on p. 110.) 

In words, explain what it means to pick one person from the study and that person is "Japanese 
American OR smokes 21-30 cigarettes per day." Also, find the probability. 

Exercise 3.9.5 (Solution on p. 110.) 

In words, explain what it means to pick one person from the study and that person is "Japanese 
American GIVEN that person smokes 21-30 cigarettes per day." Also, find the probability. 

Exercise 3.9.6 

Prove that smoking level /day and ethnicity are dependent events. 
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3.10 Practice 2: Calculating Probabilities 10 

3.10.1 Student Learning Outcomes 

• Students will define basic probability terms. 

• Students will calculate probabilities. 

• Students will determine whether two events are mutually exclusive or whether two events are inde- 
pendent. 

NOTE: Use probability rules to solve the problems below. Show your work. 

3.10.2 Given 

48% of all Californians registered voters prefer life in prison without parole over the death penalty for 
a person convicted of first degree murder. Among Latino California registered voters, 55% prefer life 
in prison without parole over the death penalty for a person convicted of first degree murder. (Source: 
http://field.com/fieldpollonline/subscribers/Rls2393.pdf ). 
37.6% of all Californians are Latino (Source: U.S. Census Bureau). 

In this problem, let: 

• C — Californians (registered voters) preferring life in prison without parole over the death penalty for a person con 

• L = Latino Californians 

Suppose that one Californian is randomly selected. 

3.10.3 Analyze the Data 

Exercise 3.10.1 (Solution on p. 110.) 

P(C) = 
Exercise 3.10.2 (Solution on p. 110.) 

P(L) = 

Exercise 3.10.3 (Solution on p. 110.) 

P(C\L) = 

Exercise 3.10.4 

In words, what is " C | L"? 

Exercise 3.10.5 (Solution on p. 110.) 

P (L AND C) = 

Exercise 3.10.6 

In words, what is "L and C"? 

Exercise 3.10.7 (Solution on p. 110.) 

Are L and C independent events? Show why or why not. 

Exercise 3.10.8 (Solution on p. 110.) 

P (L OR C) = 

Exercise 3.10.9 

In words, what is "L or C"? 

Exercise 3.10.10 (Solution on p. 110.) 

Are L and C mutually exclusive events? Show why or why not. 



"This content is available online at <http://cnx.Org/content/ml6840/l.12/>. 
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3.11 Homework Link 11 

Link to homework questions in Homework Collection /Book for Probability Topics Chapter 3 of Collabora- 
tive Statistics for R. Bloom 

Chapter 3 Homework Problems ( http://cnx.org/content/ml8924/latest/ ) 



n This content is available online at <http://cnx.Org/content/ml9053/l.l/>. 
12 http://cnx.org/content/ml8924/latest/ 
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3.12 Review Questions Link 13 

Link to Review Questions in Homework Collection /Book for Probability Topics Chapter 3 of Collaborative 
Statistics for R. Bloom 

Chapter 3 Review Questions ( http://cnx.org/content/ml9023/latest/ ) 14 



13 This content is available online at <http://cnx.Org/content/ml9032/l.l/>. 
14 http://cnx.org/content/ml9023/latest/ 
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Solutions to Exercises in Chapter 3 

Solution to Exercise 3.2.1 (p. 88) 

a. P(L')=P(S) 

b. P(MorS) 

c. P(FandL) 

d. P(MIL) 

e. P(LIM) 

f. P(SIF) 
g- P(FIL) 
h. P(ForL) 

i. P(MandS) 
j. P(F) 

Solution to Example 3.2, Problem (p. 90) 

No. C = {3, 5} and E = {1, 2, 3, 4}. P (C AND E) 
0. 
Solution to Example 3.11, Problem 1 (p. 97) 



To be mutually exclusive, P (C AND E) must be 



Hiking Area Preference 



Sex 


The Coastline 


Near Lakes and Streams 


On Mountain Peaks 


Total 


Female 


18 


16 


11 


45 


Male 


16 


25 


14 


55 


Total 


34 


41 


25 


100 



Table 3.4 



Solution to Example 3.11, Problem 2 (p. 97) 



a. P(FANDC) 

b. P(F) ■ P(C) -- 



JJL 
100 
45 

100 



= 0.18 

45 



100 



= 0.45 • 0.45 = 0.153 



P(FANDC) / P(F) ■ P(C), so the events F and Care not independent. 
Solution to Example 3.11, Problem 3 (p. 97) 

a. The word 'given' tells you that this is a conditional. 

b. P(MIL) = |f 

c. No, the sample space for this problem is 41. 

Solution to Example 3.11, Problem 4 (p. 97) 



45 
100 



a. P(F) 

b- P(P) = 1 

c. P(FANDP) 

d. P(FORP) = 



li 

" 100 
45 I 25 
100 " r 100 



11 
100 



59 
100 



Solution to Example 3.15, Problem 1 (p. 100) 

B1R1; BIR2; BIR3; B2RI; B2R2; B2R3; B3RI; B3R2; B3R3; B4R1; B4R2; B4R3; B5R1; B5R2; B5R3; B6R1; 
B6R2; B6R3; B7R1; B7R2; B7R3; B8R1; B8R2; B8R3 
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Solution to Example 3.15, Problem 6 (p. 100) 

P(BB) = ^ 

Solution to Example 3.15, Problem 7 (p. 100) 

P(B on 2nd draw I R on 1st draw) = ^y 

There are 9 + 24 outcomes that have R on the first draw (9 RR and 24 RB). The sample space is then 
9 + 24 = 33. Twenty -four of the 33 outcomes have B on the second draw. The probability is then ||. 
Solution to Example 3.16, Problem 2 (p. 101) 
P(RBorBR)=A . £ + (£)(£) = « 

Solution to Example 3.16, Problem 3 (p. 101) 

P(R on 2d I B on 1st) = ^ 
Solution to Example 3.16, Problem 4 (p. 101) 
P(R on 1st and B on 2nd) = P(RB) = (n) (flj) = m 
Solution to Example 3.16, Problem 5 (p. 101) 
P(BB) = I ■ & 

Solutions to Practice 1: Contingency Tables (modified R. Bloom) 

Solution to Exercise 3.9.1 (p. 104) 

35,065 
100,450 

Solution to Exercise 3.9.2 (p. 104) 

19,969 
100,450 

Solution to Exercise 3.9.3 (p. 105) 

4,715 
100,450 

Solution to Exercise 3.9.4 (p. 105) 

36,636 
100,450 

Solution to Exercise 3.9.5 (p. 105) 

4715 
15,273 

Solutions to Practice 2: Calculating Probabilities 

Solution to Exercise 3.10.1 (p. 106) 

0.48 

Solution to Exercise 3.10.2 (p. 106) 

0.376 

Solution to Exercise 3.10.3 (p. 106) 

0.55 

Solution to Exercise 3.10.5 (p. 106) 

0.2068 

Solution to Exercise 3.10.7 (p. 106) 

No 

Solution to Exercise 3.10.8 (p. 106) 

0.6492 

Solution to Exercise 3.10.10 (p. 106) 

No 



Chapter 4 

Discrete Random Variables 

4.1 Discrete Random Variables 1 

4.1.1 Student Learning Outcomes 

By the end of this chapter, the student should be able to: 

• Recognize and understand discrete probability distribution functions, in general. 

• Calculate and interpret expected values. 

• Recognize the binomial probability distribution and apply it appropriately. 

• Recognize the Poisson probability distribution and apply it appropriately (optional). 

• Recognize the geometric probability distribution and apply it appropriately (optional). 

• Recognize the hypergeometric probability distribution and apply it appropriately (optional). 

• Classify discrete word problems by their distributions. 

4.1.2 Introduction 

A student takes a 10 question true-false quiz. Because the student had such a busy schedule, he or she 
could not study and randomly guesses at each answer. What is the probability of the student passing the 
test with at least a 70%? 

Small companies might be interested in the number of long distance phone calls their employees make 
during the peak time of the day. Suppose the average is 20 calls. What is the probability that the employees 
make more than 20 long distance phone calls during the peak time? 

These two examples illustrate two different types of probability problems involving discrete random vari- 
ables. Recall that discrete data are data that you can count. A random variable describes the outcomes 
of a statistical experiment in words. The values of a random variable can vary with each repetition of an 
experiment. 

In this chapter, you will study probability problems involving discrete random distributions. You will also 
study long-term averages associated with them. 

4.1.3 Random Variable Notation 

Upper case letters like X or Y denote a random variable. Lower case letters like x or y denote the value of a 
random variable. If X is a random variable, then X is written in words, and x is given as a number. 



1 This content is available online at <http://cnx.Org/content/ml6825/l.14/>. 
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For example, let X = the number of heads you get when you toss three fair coins. The sample space for the 
toss of three fair coins is TTT; THH; HTH; HHT; HTT; THT; TTH; HHH. Then, x = 0, 1, 2, 3. X is in 
words and x is a number. Notice that for this example, the x values are countable outcomes. Because you 
can count the possible values that X can take on and the outcomes are random (the x values 0, 1, 2, 3), X is 
a discrete random variable. 

4.1.4 Optional Collaborative Classroom Activity 

Toss a coin 10 times and record the number of heads. After all members of the class have completed the 
experiment (tossed a coin 10 times and counted the number of heads), fill in the chart using a heading like 
the one below. Let X = the number of heads in 10 tosses of the coin. 



X 


Frequency of x 


Relative Frequency of x 







































Table 4.1 



Which value(s) of x occurred most frequently? 

If you tossed the coin 1,000 times, what values could x take on? Which value(s) of x do you think 

would occur most frequently? 

What does the relative frequency column sum to? 



4.2 Probability Distribution Function (PDF) for a Discrete Random 
Variable 2 



A discrete probability distribution function has two characteristics: 



• Each probability is between and 1, inclusive. 

• The sum of the probabilities is 1 . 



Example 4.1 

A child psychologist is interested in the number of times a newborn baby's crying wakes its mother 
after midnight. For a random sample of 50 mothers, the following information was obtained. Let 
X = the number of times a newborn wakes its mother after midnight. For this example, x = 0, 1, 2, 
3, 4, 5. 

P(x) = probability that X takes on a value x. 



2 This content is available online at <http://cnx.Org/content/ml6831/l.14/>. 
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X 


P(x) 





P(x=0) = I 


1 


P(x=l) = 11 


2 


P(x=2) = | 


3 


P(x=3) - I 


4 


P(x=4) = *, 


5 


P(x=5) = i 



Table 4.2 
X takes on the values 0, 1, 2, 3, 4, 5. This is a discrete PDF because 

1. Each P(x) is between and 1, inclusive. 

2. The sum of the probabilities is 1, that is, 



2 11 23 9 4 1 



(4.1) 



Example 4.2 

Suppose Nancy has classes 3 days a week. She attends classes 3 days a week 80% of the time, 2 
days 15% of the time, 1 day 4% of the time, and no days 1% of the time. Suppose one week is 
randomly selected. 

Problem 1 (Solution on p. 138.) 

Let X = the number of days Nancy . 



Problem 2 

X takes on what values? 



(Solution on p. 138.) 



Problem 3 (Solution on p. 138.) 

Suppose one week is randomly chosen. Construct a probability distribution table (called a PDF 
table) like the one in the previous example. The table should have two columns labeled x and P(x). 
What does the P(x) column sum to? 



4.3 Mean or Expected Value and Standard Deviation 3 

The expected value is often referred to as the "long-term"average or mean . This means that over the long 
term of doing an experiment over and over, you would expect this average. 

The mean of a random variable X is pi. If we do an experiment many times (for instance, flip a fair coin, as 
Karl Pearson did, 24,000 times and let X = the number of heads) and record the value of X each time, the 
average is likely to get closer and closer to pi as we keep repeating the experiment. This is known as the 
Law of Large Numbers. 

NOTE: To find the expected value or long term average, pi, simply multiply each value of the 
random variable by its probability and add the products. 



3 This content is available online at <http://cnx.org/content/ml6828/L16/>. 
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A Step-by-Step Example 

A men's soccer team plays soccer 0, 1, or 2 days a week. The probability that they play days is 0.2, the 
probability that they play 1 day is 0.5, and the probability that they play 2 days is 0.3. Find the long-term 
average, }i, or expected value of the days per week the men's soccer team plays soccer. 

To do the problem, first let the random variable X = the number of days the men's soccer team plays soccer 
per week. X takes on the values 0, 1, 2. Construct a PDF table, adding a column xP (x). In this column, 
you will multiply each x value by its probability. 

Expected Value Table 



X 


P(x) 


xP(x) 





0.2 


(0)(0.2) = 


1 


0.5 


(1)(0.5) = 0.5 


2 


0.3 


(2)(0.3) = 0.6 



Table 4.4: This table is called an expected value table. The table helps you calculate the expected value or 

long-term average. 



Add the last column to find the long 

(0) (0.2) + (1) (0.5) + (2) (0.3) = + 0.5 + 0.6 = 1.1. 



term average or expected value: 



The expected value is 1.1. The men's soccer team would, on the average, expect to play soccer 1.1 days 
per week. The number 1.1 is the long term average or expected value if the men's soccer team plays soccer 
week after week after week. We say y. — \.\ 

Example 4.3 

Find the expected value for the example about the number of times a newborn baby's crying 
wakes its mother after midnight. The expected value is the expected number of times a newborn 
wakes its mother after midnight. 



X 


P(X) 


*P(X) 





P(x=0) = I 


(0)(hj)=o 


1 


P(x=l) = 11 


d)(M) = M 


2 


P(x=2) = 1 


(2)(|) = i 


3 


P(x=3) = I 


(3)(|j) = l 


4 


P(x=4) = & 


(4)U) = i 


5 


P(x=5) = i 


(5)U) = ii 



Table 4.5: You expect a newborn to wake its mother after midnight 2.1 times, on the average. 
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2.1 



Add the last column to find the expected value, yi = Expected Value - ^ 
Problem 

Go back and calculate the expected value for the number of days Nancy attends classes a week. 
Construct the third column to do so. 

Solution 

2.74 days a week. 
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Example 4.4 

Suppose you play a game of chance in which five numbers are chosen from 0, 1, 2, 3, 4, 5, 6, 7, 8, 
9. A computer randomly selects five numbers from to 9 with replacement. You pay $2 to play 
and could profit $100,000 if you match all 5 numbers in order (you get your $2 back plus $100,000). 
Over the long term, what is your expected profit of playing the game? 

To do this problem, set up an expected value table for the amount of money you can profit. 

Let X = the amount of money you profit. The values of x are not 0, 1, 2, 3, 4, 5, 6, 7, 8, 9. Since you 
are interested in your profit (or loss), the values of x are 100,000 dollars and -2 dollars. 

To win, you must get all 5 numbers correct, in order. The probability of choosing one correct 
number is ^ because there are 10 numbers. You may choose a number more than once. The 
probability of choosing all 5 numbers correctly and in order is: 



11111 

To * To * io * io * io* 



1 * 10" D = 0.00001 
Therefore, the probability of winning is 0.00001 and the probability of losing is 



(4.2) 



1 - 0.00001 = 0.99999 



The expected value table is as follows. 



(4.3) 





X 


P(x) 


xP(x) 


Loss 


-2 


0.99999 


(-2)(0.99999)=-l .99998 


Profit 


100,000 


0.00001 


(100000)(0.00001)=1 



Table 4.6: Add the last column. -1.99998 + 1 = -0.99998 

Since —0.99998 is about —1, you would, on the average, expect to lose approximately one dollar 
for each game you play However, each time you play, you either lose $2 or profit $100,000. The $1 
is the average or expected LOSS per game after playing this game over and over. 

Example 4.5 

Suppose you play a game with a biased coin. You play each game by tossing the coin once. 
P(heads) = I and P(tails) — i. If you toss a head, you pay $6. If you toss a tail, you win $10. 
If you play this game many times, will you come out ahead? 

Problem 1 (Solution on p. 138.) 

Define a random variable X. 



Problem 2 

Complete the following expected value table. 



(Solution on p. 138.) 



Table 4.7 





X 






WIN 


10 


l 

3 




LOSE 






-12 

3 
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Problem 3 

What is the expected value, ]i1 Do you come out ahead? 



(Solution on p. 138.) 



Like data, probability distributions have standard deviations. To calculate the standard deviation (<r) of a 
probability distribution, find each deviation from its expected value, square it, multiply it by its probability, 
add the products, and take the square root . To understand how to do the calculation, look at the table for 
the number of days per week a men's soccer team plays soccer. To find the standard deviation, add the 
entries in the column labeled (x — ja) ■ P (x) and take the square root. 



X 


P(x) 


xP(x) 


(x - F ) 2 P(x) 





0.2 


(0)(0.2) = 


(0-l.l) 2 (.2) = 


= 0.242 


1 


0.5 


(1)(0.5) = 0.5 


(1-1.1) 2 (.5) = 


= 0.005 


2 


0.3 


(2)(0.3) = 0.6 


(2-l.l) 2 (.3) = 


= 0.243 



Table 4.8 

Add the last column in the table. 0.242 + 0.005 + 0.243 = 0.490. The standard deviation is the square root 
of 0.49. cr= V0A9 = 0.7 

Generally for probability distributions, we use a calculator or a computer to calculate \i and a to reduce 
roundoff error. For some probability distributions, there are short-cut formulas that calculate ji and a. 

4.4 Common Discrete Probability Distribution Functions 4 

Some of the more common discrete probability functions are binomial, geometric, hypergeometric, and 
Poisson. Most elementary courses do not cover the geometric, hypergeometric, and Poisson. Your instruc- 
tor will let you know if he or she wishes to cover these distributions. 

A probability distribution function is a pattern. You try to fit a probability problem into a pattern or distri- 
bution in order to perform the necessary calculations. These distributions are tools to make solving prob- 
ability problems easier. Each distribution has its own special characteristics. Learning the characteristics 
enables you to distinguish among the different distributions. 

4.5 Binomial Distribution 5 



The characteristics of a binomial experiment are: 

1. There are a fixed number of trials. Think of trials as repetitions of an experiment. The letter n denotes 
the number of trials. 

2. There are only 2 possible outcomes, called "success" and, "failure" for each trial. The letter p denotes 
the probability of a success on one trial and q denotes the probability of a failure on one trial, p + q = 1. 

3. The n trials are independent and are repeated using identical conditions. Because the n trials are in- 
dependent, the outcome of one trial does not help in predicting the outcome of another trial. Another 
way of saying this is that for each individual trial, the probability, p, of a success and probability, q, 
of a failure remain the same. For example, randomly guessing at a true - false statistics question has 
only two outcomes. If a success is guessing correctly, then a failure is guessing incorrectly. Suppose 

4 This content is available online at <http://cnx.Org/content/ml6821/l.6/>. 
This content is available online at <http://cnx.Org/content/ml6820/l.16/>. 
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Joe always guesses correctly on any statistics true - false question with probability p = 0.6. Then, 
q = 0.4 .This means that for every true - false statistics question Joe answers, his probability of success 
(p = 0.6) and his probability of failure (q = 0.4) remain the same. 

The outcomes of a binomial experiment fit a binomial probability distribution. The random variable X = 
the number of successes obtained in the n independent trials. 

The mean, ji, and variance, a 2 , for the binomial probability distribution is ji = np and a 2 = npq. The 
standard deviation, a, is then <r = ^Jnpq. 

Any experiment that has characteristics 2 and 3 and where n = 1 is called a Bernoulli Trial (named after 
Jacob Bernoulli who, in the late 1600s, studied them extensively). A binomial experiment takes place when 
the number of successes is counted in one or more Bernoulli Trials. 

Example 4.6 

At ABC College, the withdrawal rate from an elementary physics course is 30% for any given 
term. This implies that, for any given term, 70% of the students stay in the class for the entire 
term. A "success" could be defined as an individual who withdrew. The random variable is X = 
the number of students who withdraw from the randomly selected elementary physics class. 

Example 4.7 

Suppose you play a game that you can only either win or lose. The probability that you win any 
game is 55% and the probability that you lose is 45%. Each game you play is independent. If you 
play the game 20 times, what is the probability that you win 15 of the 20 games? Here, if you 
define X = the number of wins, then X takes on the values 0, 1, 2, 3, ..., 20. The probability of a 
success is p = 0.55. The probability of a failure is q = 0.45. The number of trials is n = 20. The 
probability question can be stated mathematically as P (x = 15). 

Example 4.8 

A fair coin is flipped 15 times. Each flip is independent. What is the probability of getting more 
than 10 heads? Let X = the number of heads in 15 flips of the fair coin. X takes on the values 0, 1, 
2, 3, ..., 15. Since the coin is fair, p = 0.5 and q = 0.5. The number of trials is n = 15. The probability 
question can be stated mathematically as P (x > 10). 

Example 4.9 

Approximately 70% of statistics students do their homework in time for it to be collected and 
graded. Each student does homework independently. In a statistics class of 50 students, what is 
the probability that at least 40 will do their homework on time? Students are selected randomly. 

Problem 1 (Solution on p. 138.) 

This is a binomial problem because there is only a success or a , there are a definite 

number of trials, and the probability of a success is 0.70 for each trial. 

Problem 2 (Solution on p. 138.) 

If we are interested in the number of students who do their homework, then how do we define 
X? 

Problem 3 (Solution on p. 138.) 

What values does x take on? 

Problem 4 (Solution on p. 138.) 

What is a "failure", in words? 

The probability of a success is p = 0.70. The number of trial is n = 50. 

Problem 5 (Solution on p. 138.) 

If p + q = 1, then what is q? 
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Problem 6 (Solution on p. 138.) 
The words "at least" translate as what kind of inequality for the probability question P (x 40). 

The probability question is P (x > 40). 



4.5.1 Notation for the Binomial: B = Binomial Probability Distribution Function 

X~B(n,p) 

Read this as "X is a random variable with a binomial distribution." The parameters are n and p. n = number 

of trials p = probability of a success on each trial 
Example 4.10 

It has been stated that about 41% of adult workers have a high school diploma but do not pursue 
any further education. If 20 adult workers are randomly selected, find the probability that at most 
12 of them have a high school diploma but do not pursue any further education. How many adult 
workers do you expect to have a high school diploma but do not pursue any further education? 

Let X = the number of workers who have a high school diploma but do not pursue any further 
education. 

X takes on the values 0, 1, 2, ..., 20 where n = 20 and p = 0.41. q = 1 - 0.41 = 0.59. X ~ B (20,0.41) 

Find P (x < 12) . P (x < 12) = 0.9738. (calculator or computer) 

Using the TI-83+ or the TI-84 calculators, the calculations are as follows. Go into 2nd DISTR. The 
syntax for the instructions are 

To calculate (x = value): binompdf(n, p, number) If "number" is left out, the result is the binomial 
probability table. 

To calculate P (x < value): binomcdf(n, p, number) If "number" is left out, the result is the cumu- 
lative binomial probability table. 

For this problem: After you are in 2nd DISTR, arrow down to A:binomcdf. Press ENTER. Enter 
20,.41,12). The result is P (x < 12) = 0.9738. 

NOTE: If you want to find P (x = 12), use the pdf (0:binompdf). If you want to find P (x (> , 12)), 
use 1 - binomcdf(20,.41,12). 

The probability at most 12 workers have a high school diploma but do not pursue any further 
education is 0.9738 

The graph of x ~ B (20,0.41) is: 



119 



P(X=x> 



0.2 -, 




0.15- 


r- 


— 1 


0.1- 


r- 




0.05- 

o 


[ J 


k 







123 45, 



.20 



The y-axis contains the probability of x, where X ; 
school diploma. 



the number of workers who have only a high 



The number of adult workers that you expect to have a high school diploma but not pursue any 
further education is the mean, \i = np = (20) (0.41) = 8.2. 



npq. 



The standard deviation is a = 



The formula for the variance is o 2 = 
^/(20) (0.41) (0.59) = 2.20. 

Example 4.11 

The following example illustrates a problem that is not binomial. It violates the condition of in- 
dependence. ABC College has a student advisory committee made up of 10 staff members and 
6 students. The committee wishes to choose a chairperson and a recorder. What is the proba- 
bility that the chairperson and recorder are both students? All names of the committee are put 
into a box and two names are drawn without replacement. The first name drawn determines the 
chairperson and the second name the recorder. There are two trials. However, the trials are not 
independent because the outcome of the first trial affects the outcome of the second trial. The 
probability of a student on the first draw is jg. The probability of a student on the second draw 
is jjj, when the first draw produces a student. The probability is ^ when the first draw produces 
a staff member. The probability of drawing a student's name changes for each of the trials and, 
therefore, violates the condition of independence. 



4.6 Geometric Distribution 6 

The characteristics of a geometric experiment are: 

1. There are one or more Bernoulli trials with all failures except the last one, which is a success. In other 
words, you keep repeating what you are doing until the first success. Then you stop. For example, 
you throw a dart at a bull's eye until you hit the bull's eye. The first time you hit the bull's eye is a 
"success" so you stop throwing the dart. It might take you 6 tries until you hit the bull's eye. You can 
think of the trials as failure, failure, failure, failure, failure, success. STOP. 

2. In theory, the number of trials could go on forever. There must be at least one trial. 

3. The probability^, of a success and the probability, q, of a failure is the same for each trial, p + q — 1 
and q = 1 — p. For example, the probability of rolling a 3 when you throw one fair die is |. This is 



6 This content is available online at <http://cnx.Org/content/ml6822/l.16/>. 
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true no matter how many times you roll the die. Suppose you want to know the probability of getting 
the first 3 on the fifth roll. On rolls 1, 2, 3, and 4, you do not get a face with a 3. The probability for 
each of rolls 1, 2, 3, and 4 is q = I, the probability of a failure. The probability of getting a 3 on the 
fifth roll is I ■ I ■ I ■ § ■ I = 0.0804 

X = the number of independent trials until the first success. The mean and variance are in the summary in 
this chapter. 

Example 4.12 

You play a game of chance that you can either win or lose (there are no other possibilities) until 
you lose. Your probability of losing is p = 0.57. What is the probability that it takes 5 games until 
you lose? Let X = the number of games you play until you lose (includes the losing game). Then 
X takes on the values 1, 2, 3, ... (could go on indefinitely). The probability question is P (x = 5). 

Example 4.13 

A safety engineer feels that 35% of all industrial accidents in her plant are caused by failure of 
employees to follow instructions. She decides to look at the accident reports (selected randomly 
and replaced in the pile after reading) until she finds one that shows an accident caused by failure 
of employees to follow instructions. On the average, how many reports would the safety engineer 
expect to look at until she finds a report showing an accident caused by employee failure to follow 
instructions? What is the probability that the safety engineer will have to examine at least 3 reports 
until she finds a report showing an accident caused by employee failure to follow instructions? 

Let X = the number of accidents the safety engineer must examine until she finds a report showing 
an accident caused by employee failure to follow instructions. X takes on the values 1, 2, 3, .... The 
first question asks you to find the expected value or the mean. The second question asks you to 
find P (x > 3). ("At least" translates as a "greater than or equal to" symbol). 

Example 4.14 

Suppose that you are looking for a student at your college who lives within five miles of you. You 
know that 55% of the 25,000 students do live within five miles of you. You randomly contact stu- 
dents from the college until one says he/she lives within five miles of you. What is the probability 
that you need to contact four people? 

This is a geometric problem because you may have a number of failures before you have the one 
success you desire. Also, the probability of a success stays the same each time you ask a student if 
he/she lives within five miles of you. There is no definite number of trials (number of times you 
ask a student). 

Problem 1 

Let X = the number of you must ask one says yes. 

Solution 

Let X = the number of students you must ask until one says yes. 

Problem 2 (Solution on p. 138.) 

What values does X take on? 

Problem 3 (Solution on p. 138.) 

What are p and q? 

Problem 4 (Solution on p. 139.) 

The probability question is P( ). 
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4.6.1 Notation for the Geometric: G = Geometric Probability Distribution Function 

X~G(p) 

Read this as "X is a random variable with a geometric distribution." The parameter is p. p = the probability 
of a success for each trial. 

Example 4.15 

Assume that the probability of a defective computer component is 0.02. Components are randomly 

selected. Find the probability that the first defect is caused by the 7th component tested. How 

many components do you expect to test until one is found to be defective? 

Let X = the number of computer components tested until the first defect is found. 

X takes on the values 1, 2, 3, ... where p = 0.02. X ~ G(0.02) 

Find P(x = 7). P (x = 7) = 0.0177. (calculator or computer) 

TI-83+ and TI-84: For a general discussion, see this example (binomial). The syntax is similar. 
The geometric parameter list is (p, number) If "number" is left out, the result is the geometric 
probability table. For this problem: After you are in 2nd DISTR, arrow down to D:geometpdf. 
Press ENTER. Enter .02,7). The result is P (x = 7) = 0.0177. 

The probability that the 7th component is the first defect is 0.0177. 

The graph of X ~ G(0.02) is: 
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The y-axis contains the probability of x, where X = the number of computer components tested. 

The number of components that you would expect to test until you find the first defective one is 
the mean, p = 50. 

The formula for the mean is p = 1 = q^ = 50 

The formula for the variance is a 2 = 1 • f 1 — 1 J = jj^ ' ( o{j2 ~~ 1 ) 

The standard deviation is c = . / ± ■ ( ± — 1 ] = . / ^ • I ^ — 1 ) = 49.5 
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4.7 Hypergeometric Distribution (Optional) 7 

The characteristics of a hypergeometric experiment are: 

1. You take samples from 2 groups. 

2. You are concerned with a group of interest, called the first group. 

3. You sample without replacement from the combined groups. For example, you want to choose a 
Softball team from a combined group of 11 men and 13 women. The team consists of 10 players. 

4. Each pick is not independent, since sampling is without replacement. In the Softball example, the 
probability of picking a women first is if . The probability of picking a man second is li if a woman 
was picked first. It is 53 if a man was picked first. The probability of the second pick depends on what 
happened in the first pick. 

5. You are not dealing with Bernoulli Trials. 

The outcomes of a hypergeometric experiment fit a hypergeometric probability distribution. The random 
variable X = the number of items from the group of interest. The mean and variance are given in the 
summary. 

Example 4.16 

A candy dish contains 100 jelly beans and 80 gumdrops. Fifty candies are picked at random. What 
is the probability that 35 of the 50 are gumdrops? The two groups are jelly beans and gumdrops. 
Since the probability question asks for the probability of picking gumdrops, the group of interest 
(first group) is gumdrops. The size of the group of interest (first group) is 80. The size of the 
second group is 100. The size of the sample is 50 (jelly beans or gumdrops). Let X = the number 
of gumdrops in the sample of 50. X takes on the values x = 0, 1, 2, ..., 50. The probability question 
isP(x = 35). 

Example 4.17 

Suppose a shipment of 100 VCRs is known to have 10 defective VCRs. An inspector randomly 
chooses 12 for inspection. He is interested in determining the probability that, among the 12, at 
most 2 are defective. The two groups are the 90 non-defective VCRs and the 10 defective VCRs. 
The group of interest (first group) is the defective group because the probability question asks for 
the probability of at most 2 defective VCRs. The size of the sample is 12 VCRs. (They may be 
non-defective or defective.) Let X = the number of defective VCRs in the sample of 12. X takes on 
the values 0, 1, 2, ..., 10. X may not take on the values 11 or 12. The sample size is 12, but there 
are only 10 defective VCRs. The inspector wants to know P (x < 2) ("At most" means "less than or 
equal to"). 

Example 4.18 

You are president of an on-campus special events organization. You need a committee of 7 to plan 
a special birthday party for the president of the college. Your organization consists of 18 women 
and 15 men. You are interested in the number of men on your committee. If the members of the 
committee are randomly selected, what is the probability that your committee has more than 4 
men? 

This is a hypergeometric problem because you are choosing your committee from two groups 
(men and women). 

Problem 1 (Solution on p. 139.) 

Are you choosing with or without replacement? 

Problem 2 (Solution on p. 139.) 

What is the group of interest? 



7 This content is available online at <http://cnx.Org/content/ml6824/l.15/>. 



123 



Problem 3 (Solution on p. 139.) 

How many are in the group of interest? 

Problem 4 (Solution on p. 139.) 

How many are in the other group? 

Problem 5 (Solution on p. 139.) 

Let X = on the committee. What values does X take on? 

Problem 6 (Solution on p. 139.) 

The probability question is P( ). 



4.7.1 Notation for the Hypergeometric: H = Hypergeometric Probability Distribution 
Function 

X~H(r, b, n) 

Read this as "X is a random variable with a hypergeometric distribution." The parameters are r, b, and n. r 
= the size of the group of interest (first group), b = the size of the second group, n = the size of the chosen 
sample 

Example 4.19 

A school site committee is to be chosen randomly from 6 men and 5 women. If the committee 

consists of 4 members chosen randomly, what is the probability that 2 of them are men? How 

many men do you expect to be on the committee? 

Let X = the number of men on the committee of 4. The men are the group of interest (first group). 

X takes on the values 0, 1, 2, 3, 4, where r = 6, b — 5 , and n = 4. X ~ H(6,5,4) 

Find P (x = 2). P (x = 2) = 0.4545 (calculator or computer) 

NOTE: Currently, the TI-83+ and TI-84 do not have hypergeometric probability functions. There 
are a number of computer packages, including Microsoft Excel, that do. 

The probability that there are 2 men on the committee is about 0.45. 

The graph of X~H(6,5,4) is: 
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The y-axis contains the probability of X, where X = the number of men on the committee. 
You would expect m = 2.18(about 2) men on the committee. 



The formula for the mean is p = 



n-r 

r+b 



4-6 

6+5 



= 2.18 



The formula for the variance is fairly complex. You will find it in the Summary of the Discrete 
Probability Functions Chapter (Section 4.9). 



4.8 Poisson Distribution (Optional) 8 

Characteristics of a Poisson experiment: 

1. The Poisson gives the probability of a number of events occurring in a fixed interval of time or space if 
these events happen with a known average rate and independently of the time since the last event. For 
example, a book editor might be interested in the number of words spelled incorrectly in a particular 
book. It might be that, on the average, there are 5 words spelled incorrectly in 100 pages. The interval 
is the 100 pages. 

2. The Poisson may be used to approximate the binomial if the probability of success is "small" (such 
as 0.01) and the number of trials is "large" (such as 1000). You will verify the relationship in the 
homework exercises, n is the number of trials and p is the probability of a "success." 

Poisson probability distribution. The random variable X = the number of occurrences in the interval of 
interest. The mean and variance are given in the summary. 

Example 4.20 

The average number of loaves of bread put on a shelf in a bakery in a half-hour period is 12. Of 
interest is the number of loaves of bread put on the shelf in 5 minutes. The time interval of interest 
is 5 minutes. What is the probability that the number of loaves, selected randomly, put on the shelf 
in 5 minutes is 3? 

Let X = the number of loaves of bread put on the shelf in 5 minutes. If the average number of 
loaves put on the shelf in 30 minutes (half-hour) is 12, then the average number of loaves put on 
the shelf in 5 minutes is 



8 This content is available online at <http://cnx.Org/content/ml6829/l.16/>. 
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(jjj) • 12 = 2 loaves of bread 

The probability question asks you to find P (x = 3). 

Example 4.21 

A certain bank expects to receive 6 bad checks per day, on average. What is the probability of the 
bank getting fewer than 5 bad checks on any given day? Of interest is the number of checks the 
bank receives in 1 day, so the time interval of interest is 1 day. Let X = the number of bad checks 
the bank receives in one day. If the bank expects to receive 6 bad checks per day then the average 
is 6 checks per day. The probability question asks for P {x < 5). 

Example 4.22 

You notice that a news reporter says "uh", on average, 2 times per broadcast. What is the probabil- 
ity that the news reporter says "uh" more than 2 times per broadcast. 

This is a Poisson problem because you are interested in knowing the number of times the news 
reporter says "uh" during a broadcast. 

Problem 1 (Solution on p. 139.) 

What is the interval of interest? 

Problem 2 (Solution on p. 139.) 

What is the average number of times the news reporter says "uh" during one broadcast? 

Problem 3 (Solution on p. 139.) 

Let X = . What values does X take on? 

Problem 4 (Solution on p. 139.) 

The probability question is P( ). 



4.8.1 Notation for the Poisson: P = Poisson Probability Distribution Function 

X - P(]i) 

Read this as "X is a random variable with a Poisson distribution." The parameter is ji (or A), ji (or A) = the 
mean for the interval of interest. 

Example 4.23 

Leah's answering machine receives about 6 telephone calls between 8 a.m. and 10 a.m. What is 

the probability that Leah receives more than 1 call in the next 15 minutes? 

Let X = the number of calls Leah receives in 15 minutes. (The interval of interest is 15 minutes or 
j hour.) 

x = 0,1,2,3,... 

If Leah receives, on the average, 6 telephone calls in 2 hours, and there are eight 15 minutes inter- 
vals in 2 hours, then Leah receives 

I ■ 6 = 0.75 



calls in 15 minutes, on the average. So, }i = 0.75 for this problem. 

X - P(0.75) 

Find P (x > 1). P (x > 1) = 0.1734 (calculator or computer) 
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TI-83+ and TI-84: For a general discussion, see this example (Binomial). The syntax is similar. 
The Poisson parameter list is (pi for the interval of interest, number). For this problem: 

Press 1- and then press 2nd DISTR. Arrow down to C:poissoncdf. Press ENTER. Enter .75,1). 
The result is P (x > 1) = 0.1734. NOTE: The TI calculators use A (lambda) for the mean. 

The probability that Leah receives more than 1 telephone call in the next fifteen minutes is about 

0.1734. 

The graph of X - P(0.75) is: 



P(X=^> 
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The y-axis contains the probability of x where X = the number of calls in 15 minutes. 
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4.9 Summary of Functions 9 



Formula 4.1: Binomial 
X~B(n,p) 

X = the number of successes in n independent trials 

n = the number of independent trials 

X takes on the values x — 0,1, 2, 3, ...,n 

p = the probability of a success for any trial 

q = the probability of a failure for any trial 

p + q — 1 q = 1 — p 

The mean is p — np. The standard deviation is c = ^Jnpq. 

Formula 4.2: Geometric 
X~G(p) 

X = the number of independent trials until the first success (count the failures and the first success) 

X takes on the values x= 1, 2, 3, ... 

p = the probability of a success for any trial 

q = the probability of a failure for any trial 

p + q — 1 

q = l-p 

The mean is ,u = £ 



The standard deviation isu= .M ((^) — 1 

Formula 4.3: Hypergeometric 

X~H(r,b,n) 

X = the number of items from the group of interest that are in the chosen sample. 

X may take on the values x= 0, 1, ..., up to the size of the group of interest. (The minimum value 
for X may be larger than in some instances.) 

r = the size of the group of interest (first group) 

b= the size of the second group 

n= the size of the chosen sample. 

n < r + b 

The mean is: p — ^ 



9 This content is available online at <http://cnx.Org/content/ml6833/l.10/>. 
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The standard deviation is: a = , / r \, r - L - 

Y {r+bf(r+b-l) 

Formula 4.4: Poisson 
X - ?(y) 

X = the number of occurrences in the interval of interest 

X takes on the values x = 0, 1, 2, 3, ... 

The mean ji is typically given. (A is often used as the mean instead of }i.) When the Poisson is 
used to approximate the binomial, we use the binomial mean ji = np. n is the binomial number 
of trials, p = the probability of a success for each trial. This formula is valid when n is "large" and 
p "small" (a general rule is that n should be greater than or equal to 20 and p should be less than 
or equal to 0.05). If n is large enough and p is small enough then the Poisson approximates the 
binomial very well. The variance is o 2 = ji and the standard deviation is a = ^ffi 
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4.10 Practice 1: Discrete Distribution 



10 



4.10.1 Student Learning Outcomes 

• The student will analyze the properties of a discrete distribution. 

4.10.2 Given: 

A ballet instructor is interested in knowing what percent of each year 's class will continue on to the next, 
so that she can plan what classes to offer. Over the years, she has established the following probability 
distribution. 

• Let X = the number of years a student will study ballet with the teacher. 

• Let P (x) = the probability that a student will study ballet x years. 

4.10.3 Organize the Data 

Complete the table below using the data provided. 



X 


P(x) 


x*P(x) 


1 


0.10 




2 


0.05 




3 


0.10 




4 






5 


0.30 




6 


0.20 




7 


0.10 





Table 4.9 



.10.1 

define the Random Variable X. 

.10.2 



Exercise 4. 

In words, i 

Exercise 4. 

P(x = 4) 

Exercise 4. 

P(x<4) 

Exercise 4. 

On average, how many years would you expect a child to study ballet with this teacher? 



.10.3 



.10.4 



4.10.4 Discussion Question 

Exercise 4.10.5 

What does the column "P(x)" sum to and why? 

Exercise 4.10.6 

What does the column "x * P(x)" sum to and why? 



"This content is available online at <http://cnx.Org/content/ml6830/l.14/>. 
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4.11 Practice 2: Binomial Distribution 



ii 



4.11.1 Student Learning Outcomes 

• The student will construct the Binomial Distribution. 

4.11.2 Given 

The Higher Education Research Institute at UCLA collected data from 203,967 incoming first-time, 
full-time freshmen from 270 four-year colleges and universities in the U.S. 71.3% of those students replied 
that, yes, they believe that same-sex couples should have the right to legal marital status. (Source: 
http://heri.ucla.edu/PDFs/pubs/TFS/Norrns/Monographs/TheArnericanFreshrnan2011.pdf). ) 

Suppose that you randomly pick 8 first-time, full-time freshmen from the survey. You are interested 
in the number that believes that same sex-couples should have the right to legal marital status 



4.11.3 Interpret the Data 

Exercise 4.11.1 

In words, define the random Variable X. 

Exercise 4.11.2 

X~ 



Exercise 4.11.3 

What values does the random variable X take on? 

Exercise 4.11.4 

Construct the probability distribution function (PDF). 



(Solution on p. 139.) 



(Solution on p. 139.) 



(Solution on p. 139.) 



X 


P(x) 







































Table 4.10 



Exercise 4.11.5 

On average (w), how many would you expect to answer yes? 

Exercise 4.11.6 

What is the standard deviation (c) ? 



1 This content is available online at <http://cnx.Org/content/ml7107/l.18/>. 



(Solution on p. 139.) 



(Solution on p. 139.) 
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(Solution on p. 139.) 
(Solution on p. 139.) 



Exercise 4.11.7 

What is the probability that at most 5 of the freshmen reply "yes"? 

Exercise 4.11.8 

What is the probability that at least 2 of the freshmen reply "yes"? 

Exercise 4.11.9 

Construct a histogram or plot a line graph. Label the horizontal and vertical axes with words 
Include numerical scaling. 
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4.12 Practice 3: Poisson Distribution 



12 



4.12.1 Student Learning Outcomes 

• The student will analyze the properties of a Poisson distribution. 

4.12.2 Given 

On average, eight teens in the U.S. die from motor vehicle injuries per day. As 

a result, states across the country are debating raising the driving age. (Source: 

http://wwwxdc.gov/Motorvehiclesafety/Teen_Drivers/teendrivers_factsheet.htrnl) ) 



4.12.3 Interpret the Data 

Exercise 4.12.1 

Assume the event occurs independently in any given day. In words, define the Random Variable 
X. 



Exercise 4.12.2 

X ~ 



Exercise 4.12.3 

What values does X take on? 

Exercise 4.12.4 

For the given values of the random variable X, fill in the corresponding probabilities. 



(Solution on p. 139.) 
(Solution on p. 139.) 



X 


P(x) 







4 




8 




10 




11 




15 





Table 4.11 

Exercise 4.12.5 (Solution on p. 139.) 

Is it likely that there will be no teens killed in the U.S. from motor vehicle injuries on any given 
day? Justify your answer numerically. 

Exercise 4.12.6 (Solution on p. 139.) 

Is it likely that there will be more than 20 teens killed in the U.S. from motor vehicle injuries on 
any given day? Justify your answer numerically. 



2 This content is available online at <http://cnx.Org/content/ml7109/l.15/>. 
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4.13 Practice 4: Geometric Distribution 13 
4.13.1 Student Learning Outcomes 

• The student will analyze the properties of a geometric distribution. 



4.13.2 Given: 

Use the information from the Binomial Distribution Practice (Section 4.11) shown below. 

The Higher Education Research Institute at UCLA collected data from 203,967 incoming first-time, 
full-time freshmen from 270 four-year colleges and universities in the U.S. 71.3% of those students replied 
that, yes, they believe that same-sex couples should have the right to legal marital status. (Source: 
http://heri.ucla.edu/PDFs/pubs/TFS/Norrns/Monographs/TheArnericanFreshrnan2011.pdf) 

Suppose that you randomly select freshman from the study until you find one who replies "yes." 
You are interested in the number of freshmen you must ask. 



4.13.3 Interpret the Data 

Exercise 4.13.1 

In words, define the Random Variable X. 

Exercise 4.13.2 

X ~ 

Exercise 4.13.3 

What values does the random variable X take on? 

Exercise 4.13.4 

Construct the probability distribution function (PDF). Stop at x = 6. 



(Solution on p. 140.) 



(Solution on p. 140.) 



X 


P(x) 


1 




2 




3 




4 




5 




6 





Table 4.12 

Exercise 4.13.5 (Solution on p. 140.) 

On average^), how many freshmen would you expect to have to ask until you found one who 
replies "yes?" 

Exercise 4.13.6 (Solution on p. 140.) 

What is the probability that you will need to ask fewer than 3 freshmen? 



3 This content is available online at <http://cnx.Org/content/ml7108/l.17/>. 
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Exercise 4.13.7 

Construct a histogram or plot a line graph. Label the horizontal and vertical axes with words. 
Include numerical scaling. 
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4.14 Practice 5: Hypergeometric Distribution 
4.14.1 Student Learning Outcomes 

• The student will analyze the properties of a hypergeometric distribution. 



4.14.2 Given 

Suppose that a group of statistics students is divided into two groups: business majors and non-business 
majors. There are 16 business majors in the group and 7 non-business majors in the group. A random 
sample of 9 students is taken. We are interested in the number of business majors in the group. 

4.14.3 Interpret the Data 

Exercise 4.14.1 

In words, define the Random Variable X. 

Exercise 4.14.2 

X ~ 

Exercise 4.14.3 

What values does X take on? 

Exercise 4.14.4 

Construct the probability distribution function (PDF) for X. 



(Solution on p. 140.) 



(Solution on p. 140.) 



X 


P(x) 











































Table 4.13 



Exercise 4.14.5 

On average(//), how many would you expect to be business majors? 



(Solution on p. 140.) 



4 This content is available online at <http://cnx.Org/content/ml7106/l.13/>. 
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4.15 Homework Link 15 

Link to homework questions in Homework Collection /Book for Discrete Distributions Chapter 4 of Col- 
laborative Statistics for R. Bloom 

Chapter 4 Homework Problems ( http://cnx.org/content/ml8927/latest/ ) 



15 This content is available online at <http://cnx.Org/content/ml9045/l.l/>. 
16 http://cnx.org/content/ml8927/latest/ 
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17 



4.16 Review Questions Link 

Link to Review Questions in Homework Collection /Book for Discrete Distributions Chapter 4 of Collabo- 
rative Statistics for R. Bloom 

Chapter 4 Review Problems ( http://cnx.org/content/ml9021/latest/ ) 18 



17 This content is available online at <http://cnx.Org/content/ml9033/l.l/>. 
18 http://cnx.org/content/ml9021/latest/ 
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Solutions to Exercises in Chapter 4 

Solution to Example 4.2, Problem 1 (p. 113) 

Let X = the number of days Nancy attends class per week. 

Solution to Example 4.2, Problem 2 (p. 113) 

0, 1, 2, and 3 

Solution to Example 4.2, Problem 3 (p. 113) 



Solution to Example 4.5, Problem 1 (p. 115) 
X = amount of profit 
Solution to Example 4.5, Problem 2 (p. 115) 



X 


P(x) 





0.01 


1 


0.04 


2 


0.15 


3 


0.80 



Table 4.14 





X 


P(x) 


xP (x) 


WIN 


10 


l 

3 


10 

3 


LOSE 


-6 


2 
3 


-12 
3 



Table 4.15 



Solution to Example 4.5, Problem 3 (p. 116) 

Add the last column of the table. The expected value \i 
time you play the game so you do not come out ahead. 
Solution to Example 4.9, Problem 1 (p. 117) 
failure 

Solution to Example 4.9, Problem 2 (p. 117) 

X = the number of statistics students who do their homework on time 
Solution to Example 4.9, Problem 3 (p. 117) 
0, 1, 2, . . ., 50 

Solution to Example 4.9, Problem 4 (p. 117) 

Failure is a student who does not do his or her homework on time. 
Solution to Example 4.9, Problem 5 (p. 117) 
q = 0.30 

Solution to Example 4.9, Problem 6 (p. 118) 
greater than or equal to (>) 
Solution to Example 4.14, Problem 2 (p. 120) 
1, 2, 3, . . ., (total number of students) 
Solution to Example 4.14, Problem 3 (p. 120) 



-j- . You lose, on average, about 67 cents each 



• p = 0.55 

• 9 = 0.45 
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Solution to Example 4.14, Problem 4 (p. 120) 
P(x = 4) 

Solution to Example 4.18, Problem 1 (p. 122) 
Without 

Solution to Example 4.18, Problem 2 (p. 122) 
The men 

Solution to Example 4.18, Problem 3 (p. 123) 
15 men 

Solution to Example 4.18, Problem 4 (p. 123) 
18 women 

Solution to Example 4.18, Problem 5 (p. 123) 
Let X = the number of men on the committee, x = 0, 1, 2, . . ., 7. 
Solution to Example 4.18, Problem 6 (p. 123) 
P(x>4) 

Solution to Example 4.22, Problem 1 (p. 125) 
One broadcast 

Solution to Example 4.22, Problem 2 (p. 125) 
2 

Solution to Example 4.22, Problem 3 (p. 125) 

Let X = the number of times the news reporter says "uh" during one broadcast. 
x = 0, 1, 2, 3, ... 

Solution to Example 4.22, Problem 4 (p. 125) 
P(x > 2) 

Solutions to Practice 2: Binomial Distribution 

Solution to Exercise 4.11.1 (p. 130) 
X= the number that reply "yes" 
Solution to Exercise 4.11.2 (p. 130) 
6(8,0.713) 

Solution to Exercise 4.11.3 (p. 130) 
0,1,2,3,4,5,6,7,8 

Solution to Exercise 4.11.5 (p. 130) 
5.7 

Solution to Exercise 4.11.6 (p. 130) 
1.28 

Solution to Exercise 4.11.7 (p. 131) 
0.4151 

Solution to Exercise 4.11.8 (p. 131) 
0.9990 

Solutions to Practice 3: Poisson Distribution 

Solution to Exercise 4.12.2 (p. 132) 
P(8) 

Solution to Exercise 4.12.3 (p. 132) 
0,1,2,3,4,... 

Solution to Exercise 4.12.5 (p. 132) 
No 

Solution to Exercise 4.12.6 (p. 132) 
No 
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Solutions to Practice 4: Geometric Distribution 

Solution to Exercise 4.13.2 (p. 133) 
G(0.713) 

Solution to Exercise 4.13.3 (p. 133) 
1,2,. . . 

Solution to Exercise 4.13.5 (p. 133) 
1.4 

Solution to Exercise 4.13.6 (p. 133) 
0.9176 

Solutions to Practice 5: Hypergeometric Distribution 

Solution to Exercise 4.14.2 (p. 135) 

H(16,7,9) 

Solution to Exercise 4.14.3 (p. 135) 

2,3,4,5,6,7,8,9 

Solution to Exercise 4.14.5 (p. 135) 

6.26 



Chapter 5 

Continuous Random Variables 



5.1 Introduction to Continuous Random Variables 1 

5.1.1 Student Learning Objectives 

By the end of this chapter, the student should be able to: 

• Recognize and understand continuous probability distribution functions in general. 

• Recognize the uniform probability distribution and apply it appropriately. 

• Recognize the exponential probability distribution and apply it appropriately. 

5.1.2 Introduction 

Continuous random variables have many applications. Baseball batting averages, IQ scores, the length of 
time a long distance telephone call lasts, weight, height, and temperature are just a few. Generally, for 
continuous random variables, the outcomes are measured, rather than counted. The field of reliability 
depends on a variety of continuous random variables. 

Note that the values of discrete and continuous random variables can sometimes be ambiguous. For exam- 
ple, if X is equal to the number of miles (to the nearest mile) you drive to work then X is a discrete random 
variable. You count the miles. If X is the distance you drive to work, then you measure values of X and X 
is a continuous random variable. How the random variable is defined is very important. 

This chapter gives an introduction to continuous random variables and continuous probability distribu- 
tions. There are many continuous probability distributions. We will be studying continuous distributions 
for several chapters and will use continuous probability throughout the rest of this course. We will start 
with the two simplest continuous distributions, the Uniform and the Exponential. 

5.2 Properties of Continuous Probability Distributions 2 

5.2.1 Properties of Continuous Probability Distributions 

The graph of a continuous probability distribution is a curve. Probability is represented by area under the 
curve. 



lr rhis content is available online at <http://cnx.0rg/content/ml8866/l.l/>. 
2 This content is available online at <http://cnx.0rg/content/ml886O/l.l/>. 
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The curve is called the probability density function (abbreviated: pdf). We use the symbol / (x) to rep- 
resent the curve, f (x) is the function that corresponds to the graph; we use the density function / (x) to 
draw the graph of the probability distribution. 

Area under the curve is given by a different function called the cumulative distribution function (abbre- 
viated: cdf). The cumulative distribution function is used to evaluate probability as area. 

Note that the probability distribution function (pdf) and the cumulative distribution function (cdf) are not 
the same, although they are related to each other. 

• The entire area under the curve and above the x-axis is equal to 1 . 

• Probability is found for intervals of X values rather than for individual X values. 

• P (c < X < d) is the probability that the random variable X is in the interval between the values c and 
d. P (c < X < d) is the area under the curve, above the x-axis, to the right of c and the left of d. 

• P (X = c) = The probability that X takes on any single individual value is 0. The area below the 
curve, above the x-axis, and between X=c and X=c has no width, and therefore no area (area = 0). 
Since the probability is equal to the area, the probability is also 0. 

We will find the area that represents probability by using geometry, formulas, technology, or probability 
tables. In general, calculus is needed to find the area under the curve for many probability density functions. 
When we use formulas to find the area in this textbook, the formulas were found by using the techniques 
of integral calculus. However, because most students taking this course have not studied calculus, we will 
not be using calculus in this textbook. 

There are many continuous probability distributions. When using a continuous probability distribution to 
model probability, the distribution used is selected to best model and fit the particular situation. 

In this chapter and the next chapter, we will study the uniform distribution, the exponential distribution, 
and the normal distribution. The following graphs illustrate these distributions. 



Shaded Area 
represents 

P{3 < X< 6) 



0123456789 10 X 
The Uniform Distribution 



Figure 5.1: The graph shows a Uniform Distribution with the area between X=3 and X=6 shaded to repre- 
sent the probability that the value of the random variable X is in the interval between 3 and 6. 
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Shaded Area 

represents 

P(2<X<4) 



12 3 4 5 6 7 
The Exponential Distribution 



Figure 5.2: The graph shows an Exponential Distribution with the area between X=2 and X=4 shaded to 
represent the probability that the value of the random variable X is in the interval between 2 and 4. 




Shaded area 
represents 
probability 

P(1<X<2) 



-2 



-1 

i 

The Normal Distribution 



Figure 5.3: The graph shows the Standard Normal Distribution with the area between X=l and X=2 shaded 
to represent the probability that the value of the random variable X is in the interval between 1 and 2. 



5.3 Continuous Probability Functions 3 

We begin by defining a continuous probability density function. We use the function notation f (x). Inter- 
mediate algebra may have been your first formal introduction to functions. In the study of probability, the 
functions we study are special. We define the function / (x) so that the area between it and the x-axis is 
equal to a probability. Since the maximum probability is one, the maximum area is also one. 

For continuous probability distributions, PROBABILITY = AREA. 



This content is available online at <http://cnx.Org/content/ml6805/l.9/>. 
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Example 5.1 

Consider the function / (x) = ^ for < x < 20. X = a real number. The graph of / (x) — ^j is a 
horizontal line. However, since 0<x<20,/(x)is restricted to the portion between x = and 
x = 20, inclusive . 



f(x) 

J_ 

20 







f(x) = — 

20 



20 



f(x) = ^forO < x <20. 

The graph of / (x) — i is a horizontal line segment when < x < 20. 

The area between / (x) = ^ where < x < 20 and the x-axis is the area of a rectangle with base 
= 20 and height : 



'20* 
AREA = 20 • ^ = 1 

This particular function, where we have restricted x so that the area between the function and 
the x-axis is 1, is an example of a continuous probability density function. It is used as a tool to 
calculate probabilities. 



Suppose we want to find the area between / (x) 
f(x)_j_ 

20 



h and the x-axis where < x < 2 . 



20 



2 20 

x 

AREA = (2 - 0) • ^ = 0.1 

(2 — 0) — 2 — base of a rectangle 

2jj = the height. 

The area corresponds to a probability. The probability that x is between and 2 is 0.1, which can 
be written mathematically as P(0<x<2) = P(x<2) = 0.1. 



Suppose we want to find the area between / (x) = X and the x-axis where 4 < x < 15 



145 



20 



15 



20 



AREA= (15-4) • 2n = 0.55 

(15 — 4) = 11 — the base of a rectangle 

2^j = the height. 

The area corresponds to the probability P (4 < x < 15) = 0.55. 

Suppose we want to find P (x = 15). On an x-y graph, x — 15 is a vertical line. A vertical line has 
no width (or width). Therefore, P (x = 15) = (base)(height) = (0) (±\ = 0. 



f(x) 



1 
20 



) 



15 20 



P (X < x) (can be written as P (X < x) for continuous distributions) is called the cumulative dis- 
tribution function or CDF. Notice the "less than or equal to" symbol. We can use the CDF to 
calculate P (X > x) . The CDF gives "area to the left" and P (X > x) gives "area to the right." We 
calculate P (X > x) for continuous distributions as follows: P (X > x) = 1 — P (X < x). 




P(X < x) 



P(X > x) = 1 - P(X < x) 



Label the graph with / (x) and x. Scale the x and y axes with the maximum x and y values. 
f(x) = 2n,0< x <20. 

f(x) 




2.3 



12.7 



P (2.3 < x < 12.7) = (base) (height) = (12.7 - 2.3) ( ^ ) = 0.52 
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5.4 The Uniform Distribution (modified R. Bloom) 4 

Example 5.2 

Illustrate the uniform distribution. The data that follows are a random sample of 55 smiling 

times, in seconds, of an eight-week old baby. 



10.4 


19.6 


18.8 


13.9 


17.8 


16.8 


21.6 


17.9 


12.5 


11.1 


4.9 


12.8 


14.8 


22.8 


20.0 


15.9 


16.3 


13.4 


17.1 


14.5 


19.0 


22.8 


1.3 


0.7 


8.9 


11.9 


10.9 


7.3 


5.9 


3.7 


17.9 


19.2 


9.8 


5.8 


6.9 


2.6 


5.8 


21.7 


11.8 


3.4 


2.1 


4.5 


6.3 


10.7 


8.9 


9.4 


9.4 


7.6 


10.0 


3.3 


6.7 


7.8 


11.6 


13.8 


18.6 



Table 5.1 

sample mean = 11.49 and sample standard deviation = 6.23 

We will assume that the sample of smiling times is drawn from a population that follows a uniform 
distribution between and 23 seconds. This means that any smiling time between and 23 seconds 
is equally likely. 

Let X = length of time, in seconds, of an eight-week old's smile. 

The notation for the uniform distribution is 

X ~ U (a,b) where a = the lowest value and b = the highest value. 

For this example, X - U (0,23). < X < 23. 

Formulas for the theoretical mean and standard deviation are 



]i = 2+* and a = 



(b-a) 



12 



For this problem, the theoretical mean and standard deviation are 



H 



H^ = 11.50 seconds and a = \/ (23 ,„ 0) = 6.64 seconds 



12 



Notice that the theoretical mean and standard deviation are close to the sample mean and standard 
deviation. 



J_ 

23 



Tofind/(X):/(X) = 2g L IJ 

Example 5.3 
Problem 1 

What is the probability that a randomly chosen eight-week old smiles between 2 and 18 seconds? 

Solution 

Find P (2 < X < 18). 

P (2 < X < 18) = (base) (height) = (18 - 2) • i = if. 



4 This content is available online at <http://cnx.org/content/ml8925/!. 2/>. 
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Problem 2 

Find the 90th percentile for an eight week old's smiling time. 

Solution 

Ninety percent of the smiling times fall below the 90th percentile, k, so P (X < k) = 0.90 

P(X<k) =0.90 
(base) (height) = 0.90 
(k - 0) • ^ = 0.90 
k = 23 • 0.90 = 20.7 
f(X) 
1 



23 



AREA = P(X<k)= 0.90 








23 



Problem 3: Problem 3 is OPTIONAL: Conditional Probability 

Find the probability that a random eight week old smiles more than 12 seconds KNOWING that 
the baby smiles MORE THAN 8 SECONDS. 

Solution 

Find P (X > 12|X > 8) There are two ways to do the problem. For the first way, use the fact that 
this is a conditional and changes the sample space. The graph illustrates the new sample space. 
You already know the baby smiled more than 8 seconds. 

Write anew f(X):f(X) = ^ = ^ 

for 8 < X < 23 



P(X > 12|X > 8) = (23-12) 



15 



11 

15 



X 
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For the second way, use the conditional formula from Chapter 3 with the original distribution X 

~ (J(0,23): 



P(A\B) 



P(A AND , 
P(B) 



For this problem, A is (X > 12) and B is (X > 



So P(Y^ 121 Y^ K\ - (X>12ANDX>8) _ P(X>12) 
bO, i (A ^ 12|X > 8J - pfx^H) ~ P(X>8) 



P(X>8 



11 

15 
23 



= 1- = 0.733 



23 




Example 5.4 

Uniform: The amount of time, in minutes, that a person must wait for a bus is uniformly dis- 
tributed between and 15 minutes. 

Problem 1 

What is the probability that a person waits fewer than 12.5 minutes? 

Solution 

Let X = the number of minutes a person must wait for a bus. a = and b = 15. X ~ U (0, 15). 
Write the probability density function. / (X) = j^q = X for < X < 15 

Find P (X < 12.5). Draw a graph. 

P (X < k) = (base) (height) = (12.5 - 0) ■ ^ = 0.8333 

The probability a person waits less than 12.5 minutes is 0.8333. 



1 



E 15 







12.5 15 



X 



Problem 2 

On the average, how long must a person wait? 



Find the mean, pi, and the standard deviation, a. 

7.5. On the average, a person must wait 7.5 minutes. 
4.3. The Standard deviation is 4.3 minutes. 



Solution 

ii — a + b 

r — 2 



15+0 
2 



(fe-«r 



12 



12 
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Problem 3 

Ninety percent of the time, the time a person must wait falls below what value? 

NOTE: This asks for the 90th percentile. 

Solution 

Find the 90th percentile. Draw a graph. Let k = the 90th percentile. 

P (X < k) = (base) (height) = (Jfc - 0) • (^) 

0.90 = k ■ ^ 

k = (0.90) (15) = 13.5 

k is sometimes called a critical value. 

The 90th percentile is 13.5 minutes. Ninety percent of the time, a person must wait at most 13.5 
minutes. 



^ 15 



AREA = P(X<k) = 0.90 







Example 5.5 

Uniform: Ace Heating and Air Conditioning Service finds that the amount of time a repairman 
needs to fix a furnace is uniformly distributed between 1.5 and 4 hours. Let X = the time needed 
to fix a furnace. Then X ~ U (1.5,4). 

1. Find the problem that a randomly selected furnace repair requires more than 2 hours. 

2. Find the probability that a randomly selected furnace repair requires less than 3 hours. 

3. Find the 30th percentile of furnace repair times. 

4. The longest 25% of repair furnace repairs take at least how long? (In other words: Find the 
minimum time for the longest 25% of repair times.) What percentile does this represent? 

5. Find the mean and standard deviation 

Problem 1 

Find the probability that a randomly selected furnace repair requires longer than 2 hours. 

Solution 

To find / (X): / (X) = ^ = i so / (X) 0.4 

P(X>2) = (base)(height) = (4 - 2)(0.4) = 0.8 
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Example 4 Figure 1 



f(x)=0.4- 



P(X>2) 

4 



1.5 2 3 4 



Figure 5.4: Uniform Distribution between 1.5 and 4 with shaded area between 2 and 4 representing the 
probability that the repair time X is greater than 2 



Problem 2 

Find the probability that a randomly selected furnace repair requires less than 3 hours. Describe 
how the graph differs from the graph in the first part of this example. 

Solution 

P (X < 3) = (base)(height) = (3 - 1.5)(0.4) = 0.6 

The graph of the rectangle showing the entire distribution would remain the same. However the 
graph should be shaded between X=1.5 and X=3. Note that the shaded area starts at X=1.5 rather 
than at X=0; since X-~U(1.5,4), X can not be less than 1.5. 

Example 4 Figure 2 



f(x)=0.4 - 



P(X<3) 

4 



11.5 2 



Figure 5.5: Uniform Distribution between 1.5 and 4 with shaded area between 1.5 and 3 representing the 
probability that the repair time X is less than 3 



Problem 3 

Find the 30th percentile of furnace repair times. 

Solution 
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Example 4 Figure 3 



f(x)=0.4 



Area = 


0.3 




P(X<k) = 0.3 




vt 













1.5 k 



Figure 5.6: Uniform Distribution between 1.5 and 4 with an area of 0.30 shaded to the left, representing the 
shortest 30% of repair times. 



P(X<k) =0.30 

P (X < k) = (base) (height) = (k - 1.5) • (0.4) 

0.3 = (k - 1.5) (0.4) ; Solve to find k: 

0.75 = k — 1.5 , obtained by dividing both sides by 0.4 

k = 2.25 , obtained by adding 1.5 to both sides 

The 30th percentile of repair times is 2.25 hours. 30% of repair times are 2.5 hours or less. 



Problem 4 

The longest 25% of furnace repair times take at least how long? (Find the minimum time for 
the longest 25% of repairs.) 

Solution 



Example 4 Figure 4 



f(x)=0.4- 



Area = 0.25 
P(X>k)=0.25 



1.5 



k 4 



Figure 5.7: Uniform Distribution between 1.5 and 4 with an area of 0.25 shaded to the right representing 
the longest 25% of repair times. 



P(X>k) =0.25 

P (X > k) = (base) (height) = (4 - k) ■ (0.4) 
0.25 = (4 - k)(0.4) ; Solve for k: 
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0.625 = 4 — k , obtained by dividing both sides by 0.4 
—3.375 = — k , obtained by subtracting 4 from both sides 
k=3.375 

The longest 25% of furnace repairs take at least 3.375 hours (3.375 hours or longer). 

Note: Since 25% of repair times are 3.375 hours or longer, that means that 75% of repair times are 
3.375 hours or less. 3.375 hours is the 75th percentile of furnace repair times. 

Problem 5 

Find the mean and standard deviation 

Solution 



ji = B±k an d a = V ^# 



H = l^ti = 2.75 hours and a = \j (i ^ = 0.7217 hours 



NOTE: See "Summary of the Uniform and Exponential Probability Distributions (Section 5.6)" for 
a full summary. 



5.5 The Exponential Distribution 5 

The exponential distribution is often concerned with the amount of time until some specific event occurs. 
For example, the amount of time (beginning now) until an earthquake occurs has an exponential distri- 
bution. Other examples include the length, in minutes, of long distance business telephone calls, and the 
amount of time, in months, a car battery lasts. It can be shown, too, that the value of the change that you 
have in your pocket or purse approximately follows an exponential distribution. 

Values for an exponential random variable occur in the following way. There are fewer large values and 
more small values. For example, the amount of money customers spend in one trip to the supermarket 
follows an exponential distribution. There are more people that spend less money and fewer people that 
spend large amounts of money. 

The exponential distribution is widely used in the field of reliability. Reliability deals with the amount of 
time a product lasts. 

Example 5.6 

Illustrates the exponential distribution: Let X = amount of time (in minutes) a postal clerk 
spends with his/her customer. The time is known to have an exponential distribution with the 
average amount of time equal to 4 minutes. 

X is a continuous random variable since time is measured. It is given that ]i = 4 minutes. To do 
any calculations, you must know m, the decay parameter. 

m = 77. Therefore, m = \ = 0.25 



5 This content is available online at <http://cnx.Org/content/ml6816/l.15/>. 
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The standard deviation, a, is the same as the mean. \i — a 

The distribution notation is X~Exp (m). Therefore, X~Exp (0.25). 

The probability density function is / (x) = m ■ e~ m ' x The number e = 2.71828182846... It is a number 
that is used often in mathematics. Scientific calculators have the key "e x ." If you enter 1 for x, the 
calculator will display the value e. 

The curve is: 

/ (x) = 0.25 ■ e~ °- 25x where x is at least and m = 0.25. 

For example, / (5) = 0.25 • e~ ° 25 ' 5 = 0.072 

The graph is as follows: 




(i = 4 
Notice the graph is a declining curve. When x = 0, 

/ (x) = 0.25 • e~ a25 '° = 0.25 • 1 = 0.25 = m 

Example 5.7 
Problem 1 

Find the probability that a clerk spends four to five minutes with a randomly selected customer. 

Solution 

Find? (4 < x < 5). 

The cumulative distribution function (CDF) gives the area to the left. 

P(x < x) = \-e~ m - x 

P (x < 5) = 1 - e" - 255 = 0.7135 and P (x < 4) = 1 - e -° 25A = 0.6321 
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P(4 < x < 5) 




NOTE: You can do these calculations easily on a calculator. 

The probability that a postal clerk spends four to five minutes with a randomly selected customer 



is 



P (4 < x < 5) = P (x < 5) - P (x < 4) = 0.7135 - 0.6321 = 0.0814 

NOTE: TI-83+ and TI-84: On the home screen, enter (l-e A (-.25*5))-(l-e A (-.25*4)) or enter e A (-.25*4)- 
e A (-.25*5). 



Problem 2 

Half of all customers are finished within how long? (Find the 50th percentile) 

Solution 

Find the 50th percentile. 



P( x < k) = 0.50 




k 



P (x < k) = 0.50, k = 2.8 minutes (calculator or computer) 

Half of all customers are finished within 2.8 minutes. 

You can also do the calculation as follows: 

P (x < k) = 0.50 and P (x < k) = 1 - e -°- 25/f 

Therefore, 0.50 = 1 - e" a25 '* and e -°- 25fc = 1 - 0.50 = 0.5 

Take natural logs: In (V a25fc ) = In (0.50). So, -0.25 • k = In (0.50) 

Solve for k: k = _ 2 g = 2.8 minutes 
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NOTE: A formula for the percentile k is k = — -zr^f — where LN is the natural log. 

NOTE: TI-83+ and TI-84: On the home screen, enter LN(l-.50)/-.25. Press the (-) for the negative. 



Problem 3 

Which is larger, the mean or the median? 

Solution 

Is the mean or median larger? 

From part b, the median or 50th percentile is 2.8 minutes. The theoretical mean is 4 minutes. The 
mean is larger. 



5.5.1 Optional Collaborative Classroom Activity 

Have each class member count the change he/she has in his/her pocket or purse. Your instructor will 
record the amounts in dollars and cents. Construct a histogram of the data taken by the class. Use 5 
intervals. Draw a smooth curve through the bars. The graph should look approximately exponential. Then 
calculate the mean. 

Let X = the amount of money a student in your class has in his/her pocket or purse. 

The distribution for X is approximately exponential with mean, \i = and m = . The standard 

deviation, a = . 

Draw the appropriate exponential graph. You should label the x and y axes, the decay rate, and the mean. 
Shade the area that represents the probability that one student has less than $.40 in his/her pocket or purse. 
(Shade P (x < 0.40)). 

Example 5.8 

On the average, a certain computer part lasts 10 years. The length of time the computer part lasts 
is exponentially distributed. 

Problem 1 

What is the probability that a computer part lasts more than 7 years? 

Solution 

Let x = the amount of time (in years) a computer part lasts. 

pi = 10 so m = i = ^j = 0.1 

Find P (x > 7). Draw a graph. 

P(x>7) =1-P(x <7). 

Since P (X < x) = 1 - e~ mx then P (X > x) = 1 - (1 - e~ m - x ) = e" mx 

P (x > 7) = e = 0.4966. The probability that a computer part lasts more than 7 years is 0.4966. 
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NOTE: TI-83+ and TI-84: On the home screen, enter e A (-.l*7). 



P(x > 7) 




|j. = 10 



Problem 2 

On the average, how long would 5 computer parts last if they are used one after another? 

Solution 

On the average, 1 computer part lasts 10 years. Therefore, 5 computer parts, if they are used one 
right after the other would last, on the average, 

(5) (10) = 50 years. 

Problem 3 

Eighty percent of computer parts last at most how long? 

Solution 

Find the 80th percentile. Draw a graph. Let k = the 80th percentile. 



P( s < k) = 0.80 




Solve for k: k = _ 01 = 16.1 years 

Eighty percent of the computer parts last at most 16.1 years. 

NOTE: TI-83+ and TI-84: On the home screen, enter LN(1 - .80)/-.! 



Problem 4 

What is the probability that a computer part lasts between 9 and 11 years? 
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Solution 

Find P (9 < X < 11). Draw a graph. 



P(9<x<ll) 




P (9 < x < 11) = P (a: < 11) - P < 9) = (1 - e" 01 ' 11 ) - (1 - e" 
0.0737. (calculator or computer) 



0.1-9 



) = 0.6671 - 0.5934 = 



The probability that a computer part lasts between 9 and 11 years is 0.0737. 
NOTE: TI-83+ and TI-84: On the home screen, enter e A (-.l*9) - e A (-.l*ll). 



Example 5.9 

Suppose that the length of a phone call, in minutes, is an exponential random variable with decay 
parameter = A . If another person arrives at a public telephone just before you, find the probability 
that you will have to wait more than 5 minutes. Let X = the length of a phone call, in minutes. 

Problem (Solution on p. 166.) 

What is m, ji, and u? The probability that you must wait more than 5 minutes is . 



NOTE: A summary for exponential distribution is available in "Summary of The Uniform and 
Exponential Probability Distributions (Section 5.6)". 
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5.6 Summary of the Uniform and Exponential Probability Distributions 6 

Formula 5.1: Uniform 

X = a real number between a and b (in some instances, X can take on the values a and b). a = 
smallest X ; b = largest X 

X - U (a,b) 

The mean is \i — ^^ 

The standard deviation is u = 



12 

Probability density function: / (X) = -^ for a < X < b 
Area to the Left of x: P (X < x) = (base)(height) 
Area to the Right of x: P (X > x) = (base)(height) 

Area Between c and d: P (c < X < d) — (base) (height) = (d — c) (height). 

Formula 5.2: Exponential 

X ~ Exp (m) 

X = a real number, or larger, m = the parameter that controls the rate of decay or decline 
The mean and standard deviation are the same. 

u = a = — and m = - = - 

i m }i a 

The probability density function: / (X) = m ■ e~ m ' x , X > 
Area to the Left of x: P (X < x) = 1 - e~ m ' x 
Area to the Right of x: P (X > x) = e" m * 

Area Between c and d: P (c < X < d) = P (X < d) - P (X < c) = (l - e~ m ' d ) - (1 - e~ m ' c ) 

„— m-c „— m-d 

Percentile, k: k = LN(l-Are_aToTheLeft) 



6 This content is available online at <http://cnx.Org/content/ml6813/l.10/>. 
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5.7 Practice 1: Uniform Distribution (modified R. Bloom) 7 
5.7.1 Student Learning Outcomes 

• The student will explore the properties of data with a uniform distribution. 



5.7.2 Given 

The age of cars in the staff parking lot of a suburban college is uniformly distributed from six months (0.5 
years) to 9.5 years. 



5.7.3 Describe the Data 

Exercise 5.7.1 

What is being measured here? 

Exercise 5.7.2 

In words, define the Random Variable X. 

Exercise 5.7.3 

Are the data discrete or continuous? 

Exercise 5.7.4 

The interval of values for X is: 

Exercise 5.7.5 

The distribution for X is: 



(Solution on p. 166.) 
(Solution on p. 166.) 
(Solution on p. 166.) 
(Solution on p. 166.) 
(Solution on p. 166.) 



5.7 A Probability Distribution 

Exercise 5.7.6 

Write the probability density function. 

Exercise 5.7.7 

Graph the probability distribution. 

a. Sketch the graph of the probability distribution. 



(Solution on p. 166.) 



(Solution on p. 166.) 



Figure 5.8 



7 This content is available online at <http://cnx.org/content/ml8672/1.2/>. 
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Identify the following values: 

i. Lowest value for X: 
ii. Highest value for X: 
iii. Height of the rectangle: 
iv. Label for x-axis (words): 
v. Label for y-axis (words): 



5.7.5 Random Probability 

Exercise 5.7.8 (Solution on p. 166.) 

Find the probability that a randomly chosen car in the lot was less than 4 years old. 

a. Sketch the graph. Shade the area of interest. 



Figure 5.9 



b. Find the probability. P (X < 4) = 



5.7.6 Quartiles 



(Solution on p. 166.) 



Exercise 5.7.9 

Find the average age of the cars in the lot. 

Exercise 5.7.10 (Solution on p. 166.) 

Find the third quartile of ages of cars in the lot. This means you will have to find the value such 
3 

4' 



that |, or 75%, of the cars are at most (less than or equal to) that age. 



a. Sketch the graph. Shade the area of interest. 



161 



Figure 5.10 



b. Find the value k such that P (X < k) 

c. The third quartile is: 



0.75. 
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5.8 Practice 2: Exponential Distribution 8 
5.8.1 Student Learning Outcomes 



• 



The student will analyze data following the exponential distribution. 



5.8.2 Given 

Carbon-14 is a radioactive element with a half-life of about 5730 years. Carbon-14 is said to decay exponen- 
tially. The decay rate is 0.000121 . We start with 1 gram of carbon-14. We are interested in the time (years) 
it takes to decay carbon-14. 

5.8.3 Describe the Data 

Exercise 5.8.1 

What is being measured here? 

Exercise 5.8.2 (Solution on p. 166.) 

Are the data discrete or continuous? 

Exercise 5.8.3 (Solution on p. 166.) 

In words, define the Random Variable X. 

Exercise 5.8.4 (Solution on p. 166.) 

What is the decay rate (m)? 

Exercise 5.8.5 (Solution on p. 166.) 

The distribution for X is: 



5.8.4 Probability 

Exercise 5.8.6 (Solution on p. 166.) 

Find the amount (percent of 1 gram) of carbon-14 lasting less than 5730 years. This means, find 
P(x<5730). 

a. Sketch the graph. Shade the area of interest. 



Figure 5.11 



8 This content is available online at <http://cnx.Org/content/ml6811/l.ll/>. 
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b. Find the probability. P (x < 5730) = 

Exercise 5.8.7 

Find the percentage of carbon-14 lasting longer than 10,000 years. 

a. Sketch the graph. Shade the area of interest. 



(Solution on p. 167.) 



Figure 5.12 



b. Find the probability. P (x > 10000) = 

Exercise 5.8.8 

Thirty percent (30%) of carbon-14 will decay within how many years? 

a. Sketch the graph. Shade the area of interest. 



(Solution on p. 167.) 



Figure 5.13 



b. Find the value k such that P (x < k) = 0.30. 
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5.9 Homework Link 9 

Link to homework questions in Homework Collection /Book for Continuous Distributions Chapter 5 of 
Collaborative Statistics for R. Bloom 

Chapter 5 Homework Problems ( http://cnx.org/content/ml6807/latest/ ) 10 



9 This content is available online at <http://cnx.Org/content/ml9043/l.l/>. 
10 http://cnx.org/content/ml6807/latest/ 
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5.10 Review Questions Link 

Link to Review Questions in Homework Collection /Book for Continuous Distributions Chapter 5 of Col- 
laborative Statistics for R. Bloom 

Chapter 5 Review Questions ( http://cnx.org/content/ml9020/latest/ ) 12 



n This content is available online at <http://cnx.Org/content/ml9034/l.l/>. 
12 http://cnx.org/content/ml9020/latest/ 
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Solutions to Exercises in Chapter 5 

Solution to Example 5.9, Problem (p. 157) 

• m = X 

• pi = 12 

• a=12 

P(x > 5) = 0.6592 

Solutions to Practice 1: Uniform Distribution (modified R. Bloom) 

Solution to Exercise 5.7.1 (p. 159) 

The age of cars in the staff parking lot 
Solution to Exercise 5.7.2 (p. 159) 
X = The age (in years) of cars in the staff parking lot 
Solution to Exercise 5.7.3 (p. 159) 
Continuous 

Solution to Exercise 5.7.4 (p. 159) 
0.5-9.5 

Solution to Exercise 5.7.5 (p. 159) 
X- U (0.5, 9.5) 
Solution to Exercise 5.7.6 (p. 159) 

/(*) =1 

Solution to Exercise 5.7.7 (p. 159) 



b.i. 


0.5 






b.ii. 


9.5 






b.iii 


. Age of Cars 






b.iv. 


l 

9 






b.v. 


fix) 






Solutic 


in to Exercise 5.7.8 


(P. 


160) 


b. - 


3.5 
9 






Solutic 


in to Exercise 5.7.9 


(P. 


160) 


pi = 5 








Solutic 


in to Exercise 5.7.10 (p 


. 160) 



b. k = 7.25 



Solutions to Practice 2: Exponential Distribution 

Solution to Exercise 5.8.2 (p. 162) 
Continuous 

Solution to Exercise 5.8.3 (p. 162) 
X = Time (years) to decay carbon-14 
Solution to Exercise 5.8.4 (p. 162) 
m = 0.000121 

Solution to Exercise 5.8.5 (p. 162) 
X - Exp(0.000121) 
Solution to Exercise 5.8.6 (p. 162) 

b. P (x < 5730) = 0.5001 
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Solution to Exercise 5.8.7 (p. 163) 
b. P (x > 10000) = 0.2982 
Solution to Exercise 5.8.8 (p. 163) 
b. A: = 2947.73 
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Chapter 6 

The Normal Distribution 



6.1 The Normal Distribution 1 

6.1.1 Student Learning Outcomes 

By the end of this chapter, the student should be able to: 

• Recognize the normal probability distribution and apply it appropriately. 

• Recognize the standard normal probability distribution and apply it appropriately. 

• Compare normal probabilities by converting to the standard normal distribution. 

6.1.2 Introduction 

The normal, a continuous distribution, is the most important of all the distributions. It is widely used 
and even more widely abused. Its graph is bell-shaped. You see the bell curve in almost all disciplines. 
Some of these include psychology, business, economics, the sciences, nursing, and, of course, mathematics. 
Some of your instructors may use the normal distribution to help determine your grade. Most IQ scores are 
normally distributed. Often real estate prices fit a normal distribution. The normal distribution is extremely 
important but it cannot be applied to everything in the real world. 

In this chapter, you will study the normal distribution, the standard normal, and applications associated 
with them. 

6.1.3 Optional Collaborative Classroom Activity 

Your instructor will record the heights of both men and women in your class, separately. Draw histograms 
of your data. Then draw a smooth curve through each histogram. Is each curve somewhat bell-shaped? Do 
you think that if you had recorded 200 data values for men and 200 for women that the curves would look 
bell-shaped? Calculate the mean for each data set. Write the means on the x-axis of the appropriate graph 
below the peak. Shade the approximate area that represents the probability that one randomly chosen 
male is taller than 72 inches. Shade the approximate area that represents the probability that one randomly 
chosen female is shorter than 60 inches. If the total area under each curve is one, does either probability 
appear to be more than 0.5? 



^his content is available online at <http://cnx.Org/content/ml6979/l.12/>. 
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The normal distribution has two parameters (two numerical descriptive measures), the mean (}i) and the 
standard deviation (a). If X is a quantity to be measured that has a normal distribution with mean (}i) and 
the standard deviation (a), we designate this by writing 



NORMAL:X~N (ji, a) 




The probability density function is a rather complicated function. Do not memorize it. It is not necessary. 



/(*) 



cr-s/l-n 



HW 



The cumulative distribution function is P (X < x) . It is calculated either by a calculator or a computer or 
it is looked up in a table. Technology has made the tables basically obsolete. For that reason, as well as 
the fact that there are various table formats, we are not including table instructions in this chapter. See the 
NOTE in this chapter in Calculation of Probabilities. 

The curve is symmetrical about a vertical line drawn through the mean, ja. In theory, the mean is the same 
as the median since the graph is symmetric about \i. As the notation indicates, the normal distribution 
depends only on the mean and the standard deviation. Since the area under the curve must equal one, a 
change in the standard deviation, <x, causes a change in the shape of the curve; the curve becomes fatter or 
skinnier depending on c. A change in \i causes the graph to shift to the left or right. This means there are an 
infinite number of normal probability distributions. One of special interest is called the standard normal 
distribution. 

6.2 The Standard Normal Distribution 2 

The standard normal distribution is a normal distribution of standardized values called z-scores. A z- 
score is measured in units of the standard deviation. For example, if the mean of a normal distribution is 
5 and the standard deviation is 2, the value 11 is 3 standard deviations above (or to the right of) the mean. 
The calculation is: 



x = y. + (z)cr = 5 + (3) (2) = 11 (6.1) 

The z-score is 3. 

The mean for the standard normal distribution is and the standard deviation is 1. The transformation 



X-fl 



produces the distribution Z~ N (0,1) . The value x comes from a normal distribution with 



mean ]i and standard deviation a. 

2 This content is available online at <http://cnx.org/content/ml6986/1.7/>. 
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6.3 Z-scores 3 

If X is a normally distributed random variable and X^N (pi, a), then the z-score is: 

z = ^ (6.2) 

a 

The z-score tells you how many standard deviations that the value x is above (to the right of) or below 
(to the left of) the mean, pi. Values of x that are larger than the mean have positive z-scores and values of x 
that are smaller than the mean have negative z-scores. If x equals the mean, then x has a z-score of 0. 

Example 6.1 

Suppose X ~ N (5, 6). This says that X is a normally distributed random variable with mean 
pi = 5 and standard deviation a = 6. Suppose x = 17. Then: 

x-pi 17-5 „ „„, 

z = — -f- = —— = 2 (6.3 

a 6 

This means that x = 17 is 2 standard deviations (2a) above or to the right of the mean pi = 5. 
The standard deviation is u = 6. 

Notice that: 

5 + 2-6 = 17 (The pattern is pi + za = x.) (6.4) 

Now suppose x — 1. Then: 

x — pi 1—5 

z = = = —0.67 (rounded to two decimal places) (6.5) 

u 6 

This means that x — 1 is 0.67 standard deviations (— 0.67c) below or to the left of the mean 
pi = 5. Notice that: 

5 + (—0.67) (6) is approximately equal to 1 (This has the pattern pi + (—0.67) a = 1 ) 

Summarizing, when z is positive, x is above or to the right of pi and when z is negative, x is to the 
left of or below pi. 

Example 6.2 

Some doctors believe that a person can lose 5 pounds, on the average, in a month by reducing 
his/her fat intake and by exercising consistently. Suppose weight loss has a normal distribution. 
Let X = the amount of weight lost (in pounds) by a person in a month. Use a standard deviation 
of 2 pounds. X^N (5, 2). Fill in the blanks. 

Problem 1 (Solution on p. 182.) 

Suppose a person lost 10 pounds in a month. The z-score when x — 10 pounds is z = 2.5 

(verify). This z-score tells you that x = 10 is standard deviations to the (right 

or left) of the mean (What is the mean?). 

Problem 2 (Solution on p. 182.) 

Suppose a person gained 3 pounds (a negative weight loss). Then z = . This z-score 

tells you that x = —3 is standard deviations to the ( rl ght or left) of the mean. 

Suppose the random variables X and Y have the following normal distributions: X ~N (5, 6) and 
Y ~ N (2, 1). If x = 17, then z = 2. (This was previously shown.) If y = 4, what is z? 

V - u 4-2 
z = = — - — = 2 where pi=2 and c=l. (6.6) 



3 This content is available online at <http://cnx.org/content/ml6991/1.9/>. 
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The z-score for y = 4 is z = 2. This means that 4 is z = 2 standard deviations to the right of 
the mean. Therefore, x = 17 and y = 4 are both 2 (of their) standard deviations to the right of 
their respective means. 

The z-score allows us to compare data that are scaled differently. To understand the concept, 
suppose X ~N (5, 6) represents weight gains for one group of people who are trying to gain 
weight in a 6 week period and Y ~N (2, 1) measures the same weight gain for a second group 
of people. A negative weight gain would be a weight loss. Since X — 17 and y — 4 are each 2 
standard deviations to the right of their means, they represent the same weight gain relative to 
their means. 

The Empirical Rule 

If X is a random variable and has a normal distribution with mean \i and standard deviation a then the 
Empirical Rule says (See the figure below) 

• About 68.27% of the X values lie between -la and +la of the mean }i (within 1 standard deviation of 
the mean). 

• About 95.45% of the x values lie between -2a and +2a of the mean \i (within 2 standard deviations of 
the mean). 

• About 99.73% of the X values lie between -3a and +3a of the mean \i (within 3 standard deviations of 
the mean). Notice that almost all the x values lie within 3 standard deviations of the mean. 

• The z-scores for +1(7 and -la are +1 and -1, respectively. 

• The z-scores for +2a and -la are +2 and -2, respectively. 

• The z-scores for +3a and -3a are +3 and -3 respectively. 




-3a— 2a — la |i la 2a 3a 



Example 6.3 

Suppose X has a normal distribution with mean 50 and standard deviation 6. 

• About 68.27% of the x values lie between -la = (-1)(6) = -6 and la = (1)(6) = 6. The values -6 
and 6 are within 1 standard deviation of the mean 50. The z-scores are -1 and +1 for -6 and 
6, respectively. 

• About 95.45% of the x values lie between -2a = (-2)(6) = -12 and 2a = (2)(6) = 12. The values 
-12 and 12 are within 2 standard deviations of the mean 50. The z-scores are -2 and +2 for -12 
and 12, respectively. 

• About 99.73% of the x values lie between -3a = (-3)(6) = -18 and 3a = (3)(6) = 18. The values 
-18 and 18 are within 3 standard deviations of the mean 50. The z-scores are -3 and +3 for -18 
and 18, respectively. 
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6.4 Areas to the Left and Right of x 4 

The arrow in the graph below points to the area to the left of x. This area is represented by the probability 
P (X < x). Normal tables, computers, and calculators provide or calculate the probability P (X < x). 



P(X < x) 




X 
x 

The area to the right is then P (X > x) = I - P (X < x). 

Remember, P (X < x) — Area to the left of the vertical line through x. 

P (X > x) = 1 — P (X < x) =. Area to the right of the vertical line through x 

P (X < x) is the same as P (X < x) and P (X > x) is the same as P (X > x) for continuous distributions. 

6.5 Calculations of Probabilities 5 

Probabilities are calculated by using technology. There are instructions in the chapter for the TI-83+ and 
TI-84 calculators. 

NOTE: In the Table of Contents for Collaborative Statistics, entry 15. Tables has a link to a table 
of normal probabilities. Use the probability tables if so desired, instead of a calculator. The tables 
include instructions for how to use then. 

Example 6.4 

If the area to the left is 0.0228, then the area to the right is 1 - 0.0228 = 0.9772. 

Example 6.5 

The final exam scores in a statistics class were normally distributed with a mean of 63 and a 
standard deviation of 5. 

Problem 1 

Find the probability that a randomly selected student scored more than 65 on the exam. 

Solution 

Let X = a score on the final exam. X<~N (63, 5), where pi = 63 and a = 5 

Draw a graph. 

Then, find P (x > 65). 

P (x > 65) = 0.3446 (calculator or computer) 



4 This content is available online at <http://cnx.org/content/ml6976/1.5/>. 
'This content is available online at <http://cnx.Org/content/ml6977/l.12/>. 
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0.3446 




63 65 
The probability that one student scores more than 65 is 0.3446. 

Using the TI-83+ or the TI-84 calculators, the calculation is as follows. Go into 2nd DISTR. 

After pressing 2nd DISTR, press 2 :normalcdf . 

The syntax for the instructions are shown below. 

normalcdf(lower value, upper value, mean, standard deviation) For this problem: normal- 
cdf(65,lE99,63,5) = 0.3446. You get 1E99 ( = 10 99 ) by pressing 1, the EE key (a 2nd key) and then 99. 
Or, you can enter 10~99 instead. The number 10 is way out in the right tail of the normal curve. 
We are calculating the area between 65 and 10 . In some instances, the lower number of the area 
might be -1E99 ( = — 10 ). The number —10 is way out in the left tail of the normal curve. 

HISTORICAL NOTE: The TI probability program calculates a z-score and then the probability from 
the z-score. Before technology, the z-score was looked up in a standard normal probability table 
(because the math involved is too cumbersome) to find the probability. In this example, a standard 
normal table with area to the left of the z-score was used. You calculate the z-score and look up 
the area to the left. The probability is the area to the right. 



65-63 



0.4 . Area to the left is 0.6554. P (x > 65) = P (z > 0.4) = 1 - 0.6554 = 0.3446 



Problem 2 

Find the probability that a randomly selected student scored less than 85. 

Solution 

Draw a graph. 

Then find P (x < 85). Shade the graph. P (x < 85) = 1 (calculator or computer) 
The probability that one student scores less than 85 is approximately 1 (or 100%). 
The Tl-instructions and answer are as follows: 
normalcdf (0,85,63,5) = 1 (rounds to 1) 



Problem 3 

Find the 90th percentile (that is, find the score k that has 90 % of the scores below k and 10% of 
the scores above k). 

Solution 

Find the 90th percentile. For each problem or part of a problem, draw a new graph. Draw the 
x-axis. Shade the area that corresponds to the 90th percentile. 

Let k = the 90th percentile, k is located on the x-axis. P (x < k) is the area to the left of k. The 90th 
percentile k separates the exam scores into those that are the same or lower than k and those that 
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are the same or higher. Ninety percent of the test scores are the same or lower than k and 10% are 
the same or higher, k is often called a critical value. 

k — 69 A (calculator or computer) 



P(x < k) = 0.90 




The 90th percentile is 69.4. This means that 90% of the test scores fall at or below 69.4 and 10% fall 
at or above. For the TI-83+ or TT84 calculators, use invNorm in 2nd DISTR. invNorm(area to the 
left, mean, standard deviation) For this problem, invNorm(0. 90,63,5) = 69.4 



Problem 4 

Find the 70th percentile (that is, find the score k such that 70% of scores are below k and 30% of 
the scores are above k). 

Solution 

Find the 70th percentile. 

Draw a new graph and label it appropriately, k — 65.6 

The 70th percentile is 65.6. This means that 70% of the test scores fall at or below 65.5 and 30% fall 
at or above. 

invNorm(0.70,63,5) = 65.6 



Example 6.6 

A computer is used for office work at home, research, communication, personal finances, educa- 
tion, entertainment, social networking and a myriad of other things. Suppose that the average 
number of hours a household personal computer is used for entertainment is 2 hours per day. 
Assume the times for entertainment are normally distributed and the standard deviation for the 
times is half an hour. 

Problem 1 

Find the probability that a household personal computer is used between 1.8 and 2.75 hours per 
day. 

Solution 

Let X = the amount of time (in hours) a household personal computer is used for entertainment. 
x~N (2,0.5) where y. = 2 and a = 0.5. 

Find P (1.8 < x < 2.75). 

The probability for which you are looking is the area between x — 1.8 and x = 

2.75. P (1.8 < x < 2.75) = 0.5886 
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1.8 2 2 - 75 x 

normalcdf(1.8,2.75,2,0.5) = 0.5886 

The probability that a household personal computer is used between 1.8 and 2.75 hours per day 
for entertainment is 0.5886. 



Problem 2 

Find the maximum number of hours per day that the bottom quartile of households use a personal 

computer for entertainment. 

Solution 

To find the maximum number of hours per day that the bottom quartile of households uses a 
personal computer for entertainment, find the 25th percentile, k, where P (x < k) = 0.25. 



k = 1.67 



P(s < k) = 0.25 




P(i > k) = 0.75 



invNorm(0.25,2,.5) = 1.66 

The maximum number of hours per day that the bottom quartile of households uses a personal 
computer for entertainment is 1.66 hours. 
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6.6 Summary of Formulas 6 

Formula 6.1: Normal Probability Distribution 
X~N(^,cr) 

H = the mean a = the standard deviation 

Formula 6.2: Standard Normal Probability Distribution 

Z~N(0,1) 

z = a standardized value (z-score) 

mean = standard deviation = 1 

Formula 6.3: Finding the kth Percentile 

To find the kth percentile when the z-score is known: k = ji + (z) a 

Formula 6.4: z-score 

X— u 

Formula 6.5: Finding the area to the left 
The area to the left: P (X < x) 

Formula 6.6: Finding the area to the right 

The area to the right: P (X > x) = 1 - P (X < x) 



6 This content is available online at <http://cnx.org/content/ml6987/1.5/>. 
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6.7 Practice: The Normal Distribution 7 

6.7.1 Student Learning Outcomes 

• The student will analyze data following a normal distribution. 

6.7.2 Given 

The life of Sunshine CD players is normally distributed with a mean of 4.1 years and a standard deviation 
of 1.3 years. A CD player is guaranteed for 3 years. We are interested in the length of time a CD player 
lasts. 

6.7.3 Normal Distribution 

Exercise 6.7.1 

Define the Random Variable X in words. X = 

Exercise 6.7.2 
X~ 

Exercise 6.7.3 (Solution on p. 182.) 

Find the probability that a CD player will break down during the guarantee period. 

a. Sketch the situation. Label and scale the axes. Shade the region corresponding to the probabil- 
ity 




Figure 6.1 



b. P (0 < x < . 



) = (Use zero (0) for the minimum value of x.) 

Exercise 6.7.4 (Solution on p. 182.) 

Find the probability that a CD player will last between 2.8 and 6 years. 

a. Sketch the situation. Label and scale the axes. Shade the region corresponding to the probabil- 
ity. 



7 This content is available online at <http://cnx.Org/content/ml6983/l.10/>. 
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Figure 6.2 



b. P(_ 



< x < 



Exercise 6.7.5 

Find the 70th percentile of the distribution for the time a CD player lasts. 



(Solution on p. 182.) 



a. Sketch the situation. Label and scale the axes. Shade the region corresponding to the lower 

70%. 




Figure 6.3 



b. P (x < k) = 



Therefore, k — 
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6.8 Homework Link 8 

Link to homework questions in Homework Collection /Book for Normal Distribution Chapter 6 of Collab- 
orative Statistics for R. Bloom 

Chapter 6 Homework Problems ( http://cnx.org/content/ml6978/latest/ ) 9 



8 This content is available online at <http://cnx.Org/content/ml9050/l.l/>. 
9 http://cnx.org/content/ml6978/latest/ 
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6.9 Review Questions Link 

Link to Review Questions in Homework Collection /Book for Normal Distribution Chapter 6 of Collabora- 
tive Statistics for R. Bloom 



Chapter 6 Review Questions ( http://cnx.org/content/ml9027/latest/ ) 



n 



10 This content is available online at <http://cnx.Org/content/ml9035/l.l/>. 
n http://cnx.org/content/ml9027/latest/ 
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Solutions to Exercises in Chapter 6 

Solution to Example 6.2, Problem 1 (p. 171) 

This z-score tells you that x = 10 is 2.5 standard deviations to the right of the mean 5. 

Solution to Example 6.2, Problem 2 (p. 171) 

z = -4. This z-score tells you that x = —3 is 4 standard deviations to the left of the mean. 

Solutions to Practice: The Normal Distribution 

Solution to Exercise 6.7.3 (p. 178) 

b. 3,0.1979 

Solution to Exercise 6.7.4 (p. 178) 

b. 2.8,6,0.7694 

Solution to Exercise 6.7.5 (p. 179) 

b. 0.70,4.78years 



Chapter 7 

The Central Limit Theorem 

7.1 The Central Limit Theorem 1 

7.1.1 Student Learning Outcomes 

By the end of this chapter, the student should be able to: 

• Recognize the Central Limit Theorem problems. 

• Classify continuous word problems by their distributions. 

• Apply and interpret the Central Limit Theorem for Means. 

• Apply and interpret the Central Limit Theorem for Sums. 

7.1.2 Introduction 

Why are we so concerned with means? Two reasons are that they give us a middle ground for comparison 
and they are easy to calculate. In this chapter, you will study means and the Central Limit Theorem. 

The Central Limit Theorem (CLT for short) is one of the most powerful and useful ideas in all of statistics. 
Both alternatives are concerned with drawing finite samples of size n from a population with a known 
mean, ji, and a known standard deviation, a. The first alternative says that if we collect samples of size 
n and n is "large enough," calculate each sample's mean, and create a histogram of those means, then the 
resulting histogram will tend to have an approximate normal bell shape. The second alternative says that 
if we again collect samples of size n that are "large enough," calculate the sum of each sample and create a 
histogram, then the resulting histogram will again tend to have a normal bell-shape. 

In either case, it does not matter what the distribution of the original population is, or whether you even 
need to know it. The important fact is that the sample means and the sums tend to follow the normal 
distribution. And, the rest you will learn in this chapter. 

The size of the sample, n, that is required in order to be to be 'large enough' depends on the original 
population from which the samples are drawn. If the original population is far from normal then more 
observations are needed for the sample means or the sample sums to be normal. Sampling is done with 
replacement. 

Optional Collaborative Classroom Activity 



lr rhis content is available online at <http://cnx.Org/content/ml6953/l.17/>. 
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Do the following example in class: Suppose 8 of you roll 1 fair die 10 times, 7 of you roll 2 fair dice 10 
times, 9 of you roll 5 fair dice 10 times, and 11 of you roll 10 fair dice 10 times. 

Each time a person rolls more than one die, he/she calculates the sample mean of the faces showing. For 
example, one person might roll 5 fair dice and get a 2, 2, 3, 4, 6 on one roll. 

The mean is 2+2+3+4+6 _ 3 4 j^ e 34 j s one mean w hen 5 fair dice are rolled. This same person would 
roll the 5 dice 9 more times and calculate 9 more means for a total of 10 means. 

Your instructor will pass out the dice to several people as described above. Roll your dice 10 times. For 
each roll, record the faces and find the mean. Round to the nearest 0.5. 

Your instructor (and possibly you) will produce one graph (it might be a histogram) for 1 die, one graph for 
2 dice, one graph for 5 dice, and one graph for 10 dice. Since the "mean" when you roll one die, is just the 
face on the die, what distribution do these means appear to be representing? 

Draw the graph for the means using 2 dice. Do the sample means show any kind of pattern? 

Draw the graph for the means using 5 dice. Do you see any pattern emerging? 

Finally, draw the graph for the means using 10 dice. Do you see any pattern to the graph? What can you 
conclude as you increase the number of dice? 

As the number of dice rolled increases from 1 to 2 to 5 to 10, the following is happening: 

1. The mean of the sample means remains approximately the same. 

2. The spread of the sample means (the standard deviation of the sample means) gets smaller. 

3. The graph appears steeper and thinner. 

You have just demonstrated the Central Limit Theorem (CLT). 

The Central Limit Theorem tells you that as you increase the number of dice, the sample means tend 
toward a normal distribution (the sampling distribution). 

7.2 The Central Limit Theorem for Sample Means (Averages) 2 

Suppose X is a random variable with a distribution that may be known or unknown (it can be any distri- 
bution). Using a subscript that matches the random variable, suppose: 

a. }ix = the mean of X 

b. <7x = the standard deviation of X 

If you draw random samples of size n, then as n increases, the random variable X which consists of sample 
means, tends to be normally distributed and 



X ~ N (^) 



The Central Limit Theorem for Sample Means says that if you keep drawing larger and larger samples 
(like rolling 1, 2, 5, and, finally, 10 dice) and calculating their means the sample means form their own 
normal distribution (the sampling distribution). The normal distribution has the same mean as the 
original distribution and a variance that equals the original variance divided by n, the sample size, n is the 
number of values that are averaged together not the number of times the experiment is done. 



2 This content is available online at <http://cnx.Org/content/ml6947/l.23/>. 
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To put it more formally, if you draw random samples of size n,the distribution of the random vari- 
able X, which consists of sample means, is called the sampling distribution of the mean. The sampling 
distribution of the mean approaches a normal distribution as n, the sample size, increases. 

The random variable X has a different z-score associated with it than the random variable X. x is the value 
of X in one sample. 



z = 



(es) 



(7.1) 



Hx is both the average of X and of X. 



t r "v 



< T X 



standard deviation of X and is called the standard error of the mean. 



Example 7.1 

An unknown distribution has a mean of 90 and a standard deviation of 15. Samples of size n = 25 
are drawn randomly from the population. 

Problem 1 

Find the probability that the sample mean is between 85 and 92. 

Solution 

Let X = one value from the original unknown population. The probability question asks you to 
find a probability for the sample mean. 

Let X = the mean of a sample of size 25. Since pix — 90, cr x — 15, and n = 25; 

then X ~ N (90, -^) 

Find P (85 < x < 92) Draw a graph. 

P (85 < x < 92) = 0.6997 

The probability that the sample mean is between 85 and 92 is 0.6997. 



P(S5 < I < 92) 




85 90 92 

TI-83 or 84: normal cdf (lower value, upper value, mean, standard error of the mean) 
The parameter list is abbreviated (lower value, upper value, }i, -?=) 

normal cdf (85, 92, 90, JL) = 0.6997 
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Problem 2 

Find the value that is 2 standard deviations above the expected value (it is 90) of the sample mean. 

Solution 

To find the value that is 2 standard deviations above the expected value 90, use the formula 



value = fi x + (#ofSTDEVs) f ^L 



value = 90 + 2 • jL = 96 



So, the value that is 2 standard deviations above the expected value is 96. 



Example 7.2 

The length of time, in hours, it takes an "over 40" group of people to play one soccer match is 
normally distributed with a mean of 2 hours and a standard deviation of 0.5 hours. A sample of 
size n = 50 is drawn randomly from the population. 

Problem 

Find the probability that the sample mean is between 1.8 hours and 2.3 hours. 

Solution 

Let X = the time, in hours, it takes to play one soccer match. 

The probability question asks you to find a probability for the sample mean time, in hours, it 
takes to play one soccer match. 

Let X = the mean time, in hours, it takes to play one soccer match. 

If ]ix — / °~x — / an d n = , then X ~ N( , ) 

by the Central Limit Theorem for Means. 

ji x =2,o- x = 0.5, n = 50, and X - N (l , 4| 

Find P (1.8 < x < 2.3). Draw a graph. 

P (1.8 < x < 2.3) =0.9977 

normalcdf fl.8, 2.3,2, -^\ = 0.9977 

The probability that the mean time is between 1.8 hours and 2.3 hours is . 
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7.3 The Central Limit Theorem for Sums (Optional) 3 

Suppose X is a random variable with a distribution that may be known or unknown (it can be any distri- 
bution) and suppose: 

a. fix — the mean of X 

b. ax — the standard deviation of X 

If you draw random samples of size n, then as n increases, the random variable EX which consists of sums 
tends to be normally distributed and 

EX ~ N (n ■ fix, y/n • &x) 

The Central Limit Theorem for Sums says that if you keep drawing larger and larger samples and taking 
their sums, the sums form their own normal distribution (the sampling distribution) which approaches a 
normal distribution as the sample size increases. The normal distribution has a mean equal to the original 
mean multiplied by the sample size and a standard deviation equal to the original standard deviation 
multiplied by the square root of the sample size. 

The random variable EX has the following z-score associated with it: 

a. Ex is one sum. 

b . 2 = Z*-"-1>x 

y/n-ffx 

a. n ■ fix = the mean of EX 

b. \fn ■ ax = standard deviation of EX 

Example 7.3 

An unknown distribution has a mean of 90 and a standard deviation of 15. A sample of size 80 is 
drawn randomly from the population. 

Problem 

a. Find the probability that the sum of the 80 values (or the total of the 80 values) is more than 

7500. 

b. Find the sum that is 1.5 standard deviations above the mean of the sums. 

Solution 

Let X = one value from the original unknown population. The probability question asks you to 
find a probability for the sum (or total of) 80 values. 

EX = the sum or total of 80 values. Since fix — 90, cr x = 15, and n = 80, then 
EX - N (80 • 90, V80 • 15") 



v mean of the sums = n ■ fi x = (80) (90) = 7200 

•. standard deviation of the sums = \/n ■ ax = v80 • 15 

• . sum of 80 values = Ex = 7500 



a: Find P (Ex > 7500) 



3 This content is available online at <http://cnx.Org/content/ml6948/l.16/>. 
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P (Zx > 7500) = 0.0127 




7200 7500 

normal cdf (lower value, upper value, mean of sums, stdev of sums) 
The parameter list is abbreviated (lower, upper, n ■ \ix, \/w • <Jx) 
normalcdf (7500,1E99, 80 ■ 90, yjm ■ 15 = 0.0127 
Reminder: 1E99 = 10 99 . Press the EE key for E. 



b: Find Lx where z = 1.5: 

Ex = n ■ }i x + z ■ s/n ■ a x = (80)(90) + (1.5)( v / 80) (15) = 7401.2 



7.4 Using the Central Limit Theorem (modified R. Bloom) 4 

It is important to understand when to use the CLT. Use the CLT for means or averages when you are asked 
to find the probability for a sample average or mean, or when working with percentiles for sample averages. 
(If you are being asked to find the probability or percentile of a sum or total, use the CLT for sums.) 

NOTE: If you are being asked to find the probability of an individual value, do not use the CLT. 
Use the distribution of its random variable. 



7.4.1 Law of Large Numbers 

The Law of Large Numbers says that if you take samples of larger and larger size from any population, then 
the mean x of the sample gets closer and closer to }i. From the Central Limit Theorem, we know that as n 
gets larger and larger, the sample averages follow a normal distribution. The larger n gets, the smaller the 
standard deviation gets. (Remember that the standard deviation for X is -j= .) This means that the sample 

mean x must be close to the population mean \i. We can say that fi is the value that the sample averages 
approach as n gets larger. The Central Limit Theorem illustrates the Law of Large Numbers. 

Example 7.4 

A study involving stress is done on a college campus among the students. The stress scores follow 
a continuous uniform distribution with the lowest stress score equal to 1 and the highest equal 
to 5. Using a sample of 75 students, find: 



4 This content is available online at <http://cnx.Org/content/ml8864/l.4/>. 
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a. The probability that the average stress score for the 75 students is less than 2. 

b. The 90th percentile for the average stress score for the 75 students. 

Let X = the stress score for one individual student 

The individual stress scores follow a continuous uniform distribution, X ~ U (1,5) where a — 1 
and b = 5 (See the chapter on Continuous Random Variables 5 ). 

Fx = *$■ = ^ = 3 



«* = V^# = V ^# = 1-15 

Problems a and b ask you to find a probability or a percentile for an average or mean. The sample 
size, n, is equal to 75. 

Let X = the average stress score for the 75 students. 

For the average stress score, use the CLT which tells us that X ~ N ( }i, -%= J 

X~n(3, y|) where n = 75. 

Problem 1 

Find P (X < 2) . Draw the graph. 

Solution 

P (X < 2) = 

The probability that the average stress score is less than 2 is about 0. 

p(x < 2) 




normalcdf (1,2,3, ^# J =0 

NOTE: The smallest stress score is 1. Therefore, the smallest average for 75 stress scores is 1. 



Problem 2 

Find the 90th percentile for the sample average of 75 stress scores. Draw a graph. 

Solution 

Let k = the 90th precentile. Find k where P (X < k) = 0.90. 



"Continuous Random Variables: Introduction" <http://cnx.org/content/ml6808/latest/> 
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k - 3.17 using invNorm (.90,3, ^) = 3.17 




0,90 



X 



The 90th percentile for the sample average of 75 scores is about 3.17. This means that 90% of all 
the averages of samples of 75 stress scores are at most 3.17 and 10% of the sample averages are at 
least 3.17 . 



Example 7.5 

Suppose that a market research analyst for a cell phone company conducts a study of their cus- 
tomers who exceed the time allowance included on their basic cell phone contract; the analyst 
finds that for those people who exceed the time included in their basic contract, the excess time 
used follows an exponential distribution with a mean of 22 minutes. 

Consider a random sample of 80 customers who exceed the time allowance included in their basic 
cell phone contract. 

Let X = the excess time used by one INDIVIDUAL cell phone customer who exceeds his contracted 
time allowance. 

X ~ Exp I j2 ) From Chapter 5, we know that \i = 22 and a = 22. 

Let X = the AVERAGE excess time used by a sample of n = 80 customers who exceed their 
contracted time allowance. 

X ~ N ( 22, -?2= J by the CLT for Sample Means or Averages 

Problem 1 

Using the CLT to find Probability: 

a. Find the probability that the average excess time used by the 80 customers in the sample 

is longer than 20 minutes. This is asking us to find P (X > 20) Draw the graph. 

b. Suppose that one customer who exceeds the time limit for his cell phone contract is ran- 

domly selected. Find the probability that this individual customer's excess time is 
longer than 20 minutes. This is asking us to find P (X > 20) 

c. Explain why the probabilities in (a) and (b) are different. 

Solution 
Part a. _ 

Find: P (X > 20) 
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P (X > 20) = 0.7919 using normal cdf {lO,\E99,22, -ZL\ 

The probability is 0.7919 that the average excess time used is more than 20 minutes, for a sample 
of 80 customers who exceed their contracted time allowance. 




20 22 



NOTE: 1E99 = 10 yy and-lE99 - -10 w . Press the EE key for E. Or just use 10 A 99 instead of 1E99. 

Part b. 

Find P(X>20) . Remember to use the exponential distribution for an individual: X~ Exp (1/22). 

P(X>20) = e A (-(l/22)*20) or e A (-.04545*20) = 0.4029 

Part c. Explain why the probabilities in (a) and (b) are different. 

P (X > 20) = 0.4029 but P (X > 20) = 0.7919 

The probabilities are not equal because we use different distributions to calculate the prob- 
ability for individuals and for averages. 

When asked to find the probability of an individual value, use the stated distribution of its 
random variable; do not use the CLT. Use the CLT with the normal distribution when 
you are being asked to find the probability for an average. 



Problem 2 

Using the CLT to find Percentiles: 

Find the 95th percentile for the sample average excess time for samples of 80 customers who 
exceed their basic contract time allowances. Draw a graph. 

Solution 

Let k = the 95th percentile. Find k where P (X < k) = 0.95 

k - 26.0 using invNormf.95,22, JL) = 26.0 
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0.95 




The 95th percentile for the sample average excess time used is about 26.0 minutes for random 
samples of 80 customers who exceed their contractual allowed time. 

95% of such samples would have averages under 26 minutes; only 5% of such samples would have 
averages above 26 minutes. 
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7.5 Summary of Formulas 6 

Formula 7.1: Central Limit Theorem for Sample Means 
X~N(V x ,^l) The Mean (X): Fx 

Formula 7.2: Central Limit Theorem for Sample Means Z-Score and Standard Error of the Mean 



z = x , /\ Standard Error of the Mean (Standard Deviation (X)): — 



Formula 7.3: Central Limit Theorem for Sums 
ZX~N[(«)-|ix,V^' (7 x] Mean for Sums (EX): n ■ jix 

Formula 7.4: Central Limit Theorem for Sums Z-Score and Standard Deviation for Sums 
_ x-n-nx Standard Deviation for Sums (EX): Jn ■ cv 



6 This content is available online at <http://cnx.org/content/ml6956/1.8/>. 
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7.6 Practice: Central Limit Theorem (modified R. Bloom) 7 
7.6.1 Student Learning Outcomes 

• The student will explore the properties of data through the Central Limit Theorem. 



7.6.2 Given 

Yoonie is a personnel manager in a large corporation. Each month she must review 16 of the employees. 
From past experience, she has found that the reviews take her approximately 4 hours each to do with a 
population standard deviation of 1.2 hours. Let X be the random variable representing the time it takes 
her to complete one review. Assume X is normally distributed. Let X be the random variable representing 
the average time to complete the 16 reviews. Let EX be the total time it takes Yoonie to complete all of the 
month's reviews. 

7.6.3 Distribution 

Complete the distributions. 

1. X~ 

2. X~ 



7.6.4 Graphing Probability 

For each problem below: 

a. Sketch the graph. Label and scale the horizontal axis. Shade the region corresponding to the probability. 

b. Calculate the value. 



Exercise 7.6.1 

Find the probability that one review will take Yoonie from 3.5 to 4.25 hours. 



(Solution on p. 198.) 




a. 

b. P( 



X 



<x< 



Exercise 7.6.2 (Solution on p. 198.) 

Find the probability that the average of a month's reviews will take Yoonie from 3.5 to 4.25 hrs. 



7 This content is available online at <http://cnx.org/content/ml8671/1.2/>. 
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a. 

b. P() 



Exercise 7.6.3 



Find the 95th percentile for the average time to complete one month's reviews 



(Solution on p. 198.) 




b. The 95th Percentile= 



7.6.5 Discussion Question 

Exercise 7.6.4 

What causes the probabilities in Exercise 7.6.1 and Exercise 7.6.2 to differ? 
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7.7 Homework Link 8 

Link to homework questions in Homework Collection /Book for Central Limit Theorem Chapter 7 of Col- 
laborative Statistics for R. Bloom 

Chapter 7 Homework Problems ( http://cnx.org/content/ml8940/latest/ ) 9 



8 This content is available online at <http://cnx.Org/content/ml9031/l.l/>. 
9 http://cnx.org/content/ml8940/latest/ 
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7.8 Review Questions Link 10 

Link to Review Questions in Homework Collection /Book for Central Limit Theorem Chapter 7 of Collabo- 
rative Statistics for R. Bloom 

Chapter 7 Review Problems ( http://cnx.org/content/ml8863/latest/ ) n 



10 This content is available online at <http://cnx.Org/content/ml9036/l.l/>. 
n http://cnx.org/content/ml8863/latest/ 
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Solutions to Exercises in Chapter 7 

Solutions to Practice: Central Limit Theorem (modified R. Bloom) 

Solution to Exercise 7.6.1 (p. 194) 

b. 3.5,4.25,0.2441 

Solution to Exercise 7.6.2 (p. 194) 

b. 0.7499 

NOTE:. When calculating the standard deviation for the mean using the Central Limit Theorem in this 
problem, the sample size is n = 16, the number of reviews she completes in a month. 

Solution to Exercise 7.6.3 (p. 195) 

b. 4.49 hours 



Chapter 8 

Confidence Intervals 

8.1 Confidence Intervals 1 

8.1.1 Student Learning Outcomes 

By the end of this chapter, the student should be able to: 

• Calculate and interpret confidence intervals for one population mean and one population proportion. 

• Interpret the student-t probability distribution as the sample size changes. 

• Discriminate between problems applying the normal and the student-t distributions. 

8.1.2 Introduction 

Suppose you are trying to determine the mean rent of a two-bedroom apartment in your town. You might 
look in the classified section of the newspaper, write down several rents listed, and average them together. 
You would have obtained a point estimate of the true mean. If you are trying to determine the percent of 
times you make a basket when shooting a basketball, you might count the number of shots you make and 
divide that by the number of shots you attempted. In this case, you would have obtained a point estimate 
for the true proportion. 

We use sample data to make generalizations about an unknown population. This part of statistics is called 
inferential statistics. The sample data help us to make an estimate of a population parameter. We realize 
that the point estimate is most likely not the exact value of the population parameter, but close to it. After 
calculating point estimates, we construct confidence intervals in which we believe the parameter lies. 

In this chapter, you will learn to construct and interpret confidence intervals. You will also learn a new 
distribution, the Student' s-t, and how it is used with these intervals. Throughout the chapter, it is important 
to keep in mind that the confidence interval is a random variable. It is the parameter that is fixed. 

If you worked in the marketing department of an entertainment company, you might be interested in the 
mean number of compact discs (CD's) a consumer buys per month. If so, you could conduct a survey 
and calculate the sample mean, x, and the sample standard deviation, s. You would use x to estimate 
the population mean and s to estimate the population standard deviation. The sample mean, x, is the 
point estimate for the population mean, \i. The sample standard deviation, s, is the point estimate for the 
population standard deviation, c. 



lr rhis content is available online at <http://cnx.Org/content/ml6967/l.16/>. 
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Each of x and s is also called a statistic. 

A confidence interval is another type of estimate but, instead of being just one number, it is an interval 
of numbers. The interval of numbers is a range of values calculated from a given set of sample data. The 
confidence interval is likely to include an unknown population parameter. 

Suppose for the CD example we do not know the population mean ji but we do know that the population 
standard deviation is a = 1 and our sample size is 100. Then by the Central Limit Theorem, the standard 
deviation for the sample mean is 

— = 1 =01 
•Jn Vioo 

The Empirical Rule, which applies to bell-shaped distributions, says that in approximately 95% of the 
samples, the sample mean, x, will be within two standard deviations of the population mean \i. For our CD 
example, two standard deviations is (2) (0.1) = 0.2. The sample mean x is likely to be within 0.2 units of 
¥• 

Because x is within 0.2 units of }i, which is unknown, then ji is likely to be within 0.2 units of x in 95% 
of the samples. The population mean ji is contained in an interval whose lower number is calculated by 
taking the sample mean and subtracting two standard deviations ((2) (0.1)) and whose upper number is 
calculated by taking the sample mean and adding two standard deviations. In other words, y. is between 
x — 0.2 and x + 0.2 in 95% of all the samples. 

For the CD example, suppose that a sample produced a sample mean x — 2. Then the unknown population 
mean \i is between 

x - 0.2 = 2 - 0.2 = 1.8 and x + 0.2 = 2 + 0.2 = 2.2 

We say that we are 95% confident that the unknown population mean number of CDs is between 1.8 and 
2.2. The 95% confidence interval is (1.8, 2.2). 

The 95% confidence interval implies two possibilities. Either the interval (1.8, 2.2) contains the true mean \i 
or our sample produced an x that is not within 0.2 units of the true mean ji. The second possibility happens 
for only 5% of all the samples (100% - 95%). 

Remember that a confidence interval is created for an unknown population parameter like the population 
mean, \i. Confidence intervals for some parameters have the form 

(point estimate - margin of error, point estimate + margin of error) 

The margin of error depends on the confidence level or percentage of confidence. 

When you read newspapers and journals, some reports will use the phrase "margin of error." Other reports 
will not use that phrase, but include a confidence interval as the point estimate + or - the margin of error. 
These are two ways of expressing the same concept. 

NOTE: Although the text only covers symmetric confidence intervals, there are non-symmetric 
confidence intervals (for example, a confidence interval for the standard deviation). 



8.1.3 Optional Collaborative Classroom Activity 

Have your instructor record the number of meals each student in your class eats out in a week. Assume 
that the standard deviation is known to be 3 meals. Construct an approximate 95% confidence interval for 
the true mean number of meals students eat out each week. 



1. Calculate the sample mean. 

2. (J = 3 and n = the number of students surveyed. 
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3. Construct the interval lx-2- -%=,x + 2- -%- 

We say we are approximately 95% confident that the true average number of meals that students eat out in 
a week is between and . 

8.2 Confidence Intervals for a Population Mean, Population Standard 
Deviation Known, Normal (modified R. Bloom) 2 

8.2.1 Calculating the Confidence Interval 

To construct a confidence interval for a single unknown population mean \i , where the population stan- 
dard deviation is known, we need x as an estimate for \i and we need the margin of error. Here, the 
margin of error is called the error bound for a population mean (abbreviated EBM). The sample mean x is 
the point estimate of the unknown population mean ji 

The confidence interval estimate will have the form: 

(point estimate - error bound, point estimate + error bound) or, in symbols,(x — EBM, x + EBM) 

The margin of error depends on the confidence level (abbreviated CL). The confidence level is the probabil- 
ity that the confidence interval estimate that we will calculate will contain the true population parameter. 
Most often, it is the choice of the person constructing the confidence interval to choose a confidence level of 
90% or higher because he wants to be reasonably certain of his conclusions. 

There is another probability called alpha (a), a is related to the confidence level CL. a is the probability that 
the sample produced a point estimate that is not within the appropriate margin of error of the unknown 
population parameter. 

Example 8.1 

Suppose we have collected data from a sample. We know the sample average but we do not 

know the average for the entire population. 
The sample mean is 7 and the error bound for the mean is 2.5. 

x = 7 and EBM = 2.5. 

The confidence interval is (7 — 2.5, 7 + 2.5); calculating the values gives (4.5, 9.5). 

If the confidence level (CL) is 95%, then we say that "We estimate with 95% confidence that the 
true value of the population mean is between 4.5 and 9.5." 

A confidence interval for a population mean with a known standard deviation is based on the fact that the 
sample means follow an approximately normal distribution. Suppose that our sample has a mean of x = 10 
and we have constructed the 90% confidence interval (5, 15) where EBM = 5. 

To get a 90% confidence interval, we must include the central 90% of the probability of the normal distribu- 
tion. If we include the central 90%, we leave out a total of 10% in both tails, or 5% in each tail, of the normal 
distribution. 



2 This content is available online at <http://cnx.Org/content/ml8937/l.2/>. 
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Confidence Level (CL) = 0.90 




X 



x= 10 
EBM = 5 
x - EBM = 5 
x + EBM = 15 



jx is believed to be in the interval (5, 15) with 90% confidence. 

To capture the central 90%, we must go out 1.645 "standard deviations" on either side of the calculated 
sample mean. 1.645 is the z-score from a Standard Normal probability distribution that puts an area of 0.90 
in the center, an area of 0.05 in the far left tail, and an area of 0.05 in the far right tail. 

It is important that the "standard deviation" used must be appropriate for the parameter we are estimating. 
So in this section, we need to use the standard deviation that applies to sample means, which is -j= . -7= is 
commonly called the "standard error of the mean" in order to clearly distinguish the standard deviation for 
a mean from the population standard deviation a. 

In summary, as a result of the Central Limit Theorem: 

• X is normally distributed, that is, X ~ N I }ix, -7= ) 

• When the population standard deviation a is known, we use a Normal distribution to calculate 
the error bound. 

Calculating the Confidence Interval: 

To construct a confidence interval estimate for an unknown population mean, we need data from a random 
sample. The steps to construct and interpret the confidence interval are: 

• Calculate the sample mean X from the sample data. Remember, in this section, we already know the 
population standard deviation a. 

• Find the Z-score that corresponds to the confidence level. 

• Calculate the error bound EBM 

• Construct the confidence interval 
Write a sentence that interprets the estimate in the context of the situation in the problem. (Explain 
what the confidence interval means, in the words of the problem.) 



• 



We will first examine each step in more detail, and then illustrate the process with some examples. 

Finding Z for the stated Confidence Level 

When we know the population standard deviation a, we use a standard normal distribution to calculate 
the error bound EBM and construct the confidence interval. We need to find the value of Z that puts an area 
equal to the confidence level (in decimal form) in the middle of the standard normal distribution Z<~N(0,1). 

The confidence level, CL, is he area in the middle of the standard normal distribution. CL — 1 — a. So a is 
the area that is split equally between the two tails. Each of the tails contains an area equal to | . 

The z-score that has an area to the right of § is denoted by 2 a 
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For example, when CL = 0.95 then oc = 0.05 and | = 0.025 ; we write Ze — z.025 

The area to the right of z.025 is 0.025 and the area to the left of Z.025 is 1-0.025 = 0.975 

Za = Z0.025 = 1.96 , using a calculator, computer or a Standard Normal probability table. 

Using the TI83, TI83+ or TI84+ calculator: invNorm(.975,0, 1) = 1.96 

CALCULATOR NOTE: Remember to use area to the LEFT of z« ; in this chapter the last two inputs in the 
invnorm command are 0,1 because you are using a Standard Normal Distribution Z~N(0,1) 

EBM: Error Bound 

The error bound formula for an unknown population mean }i when the population standard deviation a is 
known is 

• EBM = z« ■ 4= 

2 v» 

Constructing the Confidence Interval 

• The confidence interval estimate has the format (x — EBM,x + EBM). 
The graph gives a picture of the entire situation. 

CL + f + f = CL + « = 1. 

a a 

T CL=1-a T 




X 



x - EBM x + EBM 



Writing the Interpretation 

The interpretation should clearly state the confidence level (CL), explain what population parameter is 
being estimated (here, a population mean or average ), and should state the confidence interval (both 

endpoints). "We estimate with % confidence that the true population average (include context of the 

problem) is between and (include appopriate units)." 

Example 8.2 

Suppose scores on exams in statistics are normally distributed with an unknown population mean 
and a population standard deviation of 3 points. A random sample of 36 scores is taken and gives 
a sample mean (sample average score) of 68. Find a confidence interval estimate for the population 
mean exam score (the average score on all exams). 

Problem 

Find a 90% confidence interval for the true (population) mean of statistics exam scores. 

• The solution is shown step-by-step. 
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Solution 

To find the confidence interval, you need the sample mean, x, and the EBM. 

x = 68 

EBM = 2 rte) 

a = 3 ; n = 36 ; The confidence level is 90% (CL=0.90) 
CL = 0.90 soa = l-CL = l- 0.90 = 0.10 

I = 0.05 Z« - 2.05 

The area to the right of z.05 is 0.05 and the area to the left of Z.05 is 1—0.05=0.95 

z«= z 05 = 1.645 

using invnorm(. 95,0,1) on the TI-83,83+,84+ calculators. This can also be found using appropriate 
commands on other calculators, using a computer, or using a probability table for the Standard 
Normal distribution. 

EBM = 1.645 • (^m)= 0-8225 

x - EBM = 68 - 0.8225 = 67.1775 

x + EBM = 68 + 0.8225 = 68.8225 

The 90% confidence interval is (67.1775, 68.8225). 

Interpretation 

We estimate with 90% confidence that the true population mean exam score for all statistics stu- 
dents is between 67.18 and 68.82. 

Explanation of 90% Confidence Level 

90% of all confidence intervals constructed in this way contain the true average statistics exam 
score. For example, if we constructed 100 of these confidence intervals, we would expect 90 of 
them to contain the true population mean exam score. 



8.2.2 Changing the Confidence Level or Sample Size 

Example 8.3: Changing the Confidence Level 

Suppose we change the original problem by using a 95% confidence level. Find a 95% confidence 
interval for the true (population) mean statistics exam score. 

Solution 

To find the confidence interval, you need the sample mean, x, and the EBM. 

x = 68 

EBM = z« • (-?= 
2 Vv« 
a = 3 ; n = 36 ; The confidence level is 95% (CL=0.95) 

CL = 0.95 soa = l-CL = l- 0.95 = 0.05 

§ = 0.025 z« = z.025 
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The area to the right of z.025 is 0.025 and the area to the left of z.025 is 1—0.025=0.975 

z f = 2.025 = 1-96 

using invnorm(. 975,0,1) on the TI-83,83+,84+ calculators. (This can also be found using appropri- 
ate commands on other calculators, using a computer, or using a probability table for the Standard 
Normal distribution.) 



EBM = 1.96 



3b 



= 0.98 



x - EBM = 68 - 0.98 = 67.02 
x + EBM = 68 + 0.98 = 68.98 



Interpretation 

We estimate with 95 % confidence that the true population average for all statistics exam scores is 
between 67.02 and 68.98. 

Explanation of 95% Confidence Level 

95% of all confidence intervals constructed in this way contain the true value of the population 
average statistics exam score. 

Comparing the results 

The 90% confidence interval is (67.18, 68.82). The 95% confidence interval is (67.02, 68.98). The 
95% confidence interval is wider. If you look at the graphs, because the area 0.95 is larger than the 
area 0.90, it makes sense that the 95% confidence interval is wider. 



0,05 



0.00 



0.05 



U.025 




0.95 



0.02S 




(a) 



Figure 8.1 



Summary: Effect of Changing the Confidence Level 



(b) 



• Increasing the confidence level increases the error bound, making the confidence interval 
wider. 

• Decreasing the confidence level decreases the error bound, making the confidence interval 
narrower. 



Example 8.4: Changing the Sample Size: 

Suppose we change the original problem to see what happens to the error bound if the sample size 
is changed. 
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Problem 

Leave everything the same except the sample size. Use the original 90% confidence level. What 
happens to the error bound and the confidence interval if we increase the sample size and use 
n=100 instead of n=36? What happens if we decrease the sample size to n=25 instead of n=36? 



• x = 68 

• EBM = z, 



s/n 



• a = 3 ; The confidence level is 90% (CL=0.90) ; z* = z, 05 = 1.645 

Solution A 

If we increase the sample size n to 100, we decrease the error bound. 

When n = 100 : EBM = z* ■ (-/=) = 1.645 • (-7=) = 0.4935 

Solution B 

If we decrease the sample size n to 25, we increase the error bound. 

When n = 25 : EBM = z« ■ ( 4rl = 1-645 • ( -4s) = 0.987 



V"J ' V V25 



Summary: Effect of Changing the Sample Size 



• 



• 



Increasing the sample size causes the error bound to decrease, making the confidence inter- 
val narrower. 

Decreasing the sample size causes the error bound to increase, making the confidence inter- 
val wider. 



8.2.3 Working Backwards to Find the Error Bound or Sample Mean 

Working Backwards to find the Error Bound or the Sample Mean 

When we calculate a confidence interval, we find the sample mean and calculate the error bound and use 
them to calculate the confidence interval. But sometimes when we read statistical studies, the study may 
state the confidence interval only. If we know the confidence interval, we can work backwards to find both 
the error bound and the sample mean. 

Finding the Error Bound 

• From the upper value for the interval, subtract the sample mean 

• OR, From the upper value for the interval, subtract the lower value. Then divide the difference by 2. 

Finding the Sample Mean 

• Subtract the error bound from the upper value of the confidence interval 

• OR, Average the upper and lower endpoints of the confidence interval 

Notice that there are two methods to perform each calculation. You can choose the method that is easier to 
use with the information you know. 

Example 8.5 

Suppose we know that a confidence interval is (67.18, 68.82) and we want to find the error bound. 
We may know that the sample mean is 68. Or perhaps our source only gave the confidence interval 
and did not tell us the value of the the sample mean. 

Calculate the Error Bound: 
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• If we know that the sample mean is 68: EBM — 68.82 - 68 - 0.82 

• If we don't know the sample mean: EBM = - — : — s — : — = 0-82 

Calculate the Sample Mean: 

• If we know the error bound: x = 68.82 — 0.82 = 68 

• If we don't know the error bound: x = - — : — s — : — - = 68 



8.2.4 Calculating the Sample Size n 

If researchers desire a specific margin of error, then they can use the error bound formula to calculate the 
required sample size. 

The error bound formula for a population mean when the population standard deviation is known is 

_ 2 2 

The formula for sample size is n = 2 a 2 > found by solving the error bound formula for n 

In this formula, z is z « , corresponding to the desired confidence level. A researcher planning a study who 
wants a specified confidence level and error bound can use this formula to calculate the size of the sample 
needed for the study. 

Example 8.6 

The population standard deviation for the age of Foothill College students is 15 years. If we 
want to be 95% confident that the sample mean age is within 2 years of the true population mean 
age of Foothill College students , how many randomly selected Foothill College students must be 
surveyed? 

From the problem, we know that a = 15 and EBM=2 
z = zq25 = 1.96, because the confidence level is 95%. 

n = 2 a 2 = 1 ' 9 ^ 15 =216.09 using the sample size equation. 

Use n = 217: Always round the answer UP to the next higher integer to ensure that the sample 
size is large enough. 

Therefore, 217 Foothill College students should be surveyed in order to be 95% confident that we 
are within 2 years of the true population age of Foothill College students. 



8.3 Confidence Interval for a Population Mean, Standard Deviation Un- 
known, Student-T (modifed R. Bloom) 3 

In practice, we rarely know the population standard deviation. In the past, when the sample size was large, 
this did not present a problem to statisticians. They used the sample standard deviation s as an estimate 
for c and proceeded as before to calculate a confidence interval with close enough results. However, 
statisticians ran into problems when the sample size was small. A small sample size caused inaccuracies in 
the confidence interval. 



3 This content is available online at <http://cnx.Org/content/ml8935/l.2/>. 
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William S. Gossett (1876-1937) of the Guinness brewery in Dublin, Ireland ran into this problem. His exper- 
iments with hops and barley produced very few samples. Just replacing a with s did not produce accurate 
results when he tried to calculate a confidence interval. He realized that he could not use a normal distri- 
bution for the calculation; he found that the actual distribution depends on the sample size. This problem 
led him to "discover" what is called the Student-t distribution. The name comes from the fact that Gosset 
wrote under the pen name "Student." 

Up until the mid 1990s, statisticians used the normal distribution approximation for large sample sizes 
and only used the Student-t distribution for sample sizes of at most 30. With the common use of graphing 
calculators and computers, the practice is to use the Student-t distribution whenever s is used as an estimate 
for a. 

If you draw a simple random sample of size n from a population that has approximately a normal distri- 
bution with mean ji and unknown population standard deviation a and calculate the t-score t = J K , 

then the t-scores follow a Student-t distribution with n — 1 degrees of freedom. The t-score has the same 
interpretation as the z-score. It measures how far x is from its mean \i. For each sample size n, there is a 
different Student-t distribution. 

The degrees of freedom, n — 1, come from the calculation of the sample standard deviation s. In Chapter 
2, we used n deviations (x — x values) to calculate s. Because the sum of the deviations is 0, we can find 
the last deviation once we know the other n — \ deviations. The other n — 1 deviations can change or vary 
freely. We call the number n — 1 the degrees of freedom (df). 

Properties of the Student-t Distribution 

• The graph for the Student-t distribution is similar to the Standard Normal curve. 

• The mean for the Student-t distribution is and the distribution is symmetric about 0. 

• The Student-t distribution has more probability in its tails than the Standard Normal distribution 
because the spread of the t distribution is greater than the spread of the Standard Normal. So the 
graph of the Student-t distribution will be thicker in the tails and shorter in the center than the graph 
of the Standard Normal distribution. 

• The exact shape of the Student-t distribution depends on the "degrees of freedom". As the degrees 
of freedom increases, the graph Student-t distribution becomes more like the graph of the Standard 
Normal distribution. 

• The underlying population of individual observations is assumed to be normally distributed with 
unknown population mean \i and unknown population standard deviation a. In the real world, 
however, as long as the underlying population is large and bell-shaped, and the data are a simple 
random sample, practitioners often consider the assumptions met. 

Calculators and computers can easily calculate any Student-t probabilities. The TI-83,83+,84+ have a tcdf 
function to find the probability for given values of t. The grammar for the tcdf command is tcdf(lower 
bound, upper bound, degrees of freedom). However for confidence intervals, we need to use inverse 
probability to find the value of t when we know the probability. 

For the TI-84+ we will use the invT command on the DISTRibution menu. The invT command works 
similarly to the invnorm. The invT command requires two inputs: invT(area to the left, degrees of 
freedom) The output is the t-score that corresponds to the area we specified. 

The TI-83 and TI-83+ do not have the invT command but you can download an invT program from your 
instructor that is easy to use. (The TI-89 has an inverse T command.) 

A probability table for the Student-t distribution can also be used. The table gives t-scores that correspond 
to the confidence level (column) and degrees of freedom (row). (The TI-86 does not have an invT program 
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or command, so if you are using that calculator, you need to use a probability table for the Student-t distri- 
bution.) When using t-table, note that some tables are formatted to show the confidence level in the column 
headings, while the column headings in some tables may show only corresponding area in one or both tails. 

The notation for the Student-t distribution is (using T as the random variable) is 

• T ~ tdf where df = n — 1 . 

• For example, if we have a sample of size n=20 items, then we calculate the degrees of freedom as 
df=n— 1=20— 1=19 and we write the distribution as T ~ f^g 

If the population standard deviation is not known, the error bound for a population mean is: 

• EBM = t r{^) 

• t « is the t-score with area to the right equal to | 

• use df = n — 1 degrees of freedom 

• s = sample standard deviation 

Example 8.7 

Suppose you do a study of acupuncture to determine how effective it is in relieving pain. You 
measure sensory rates for 15 subjects with the results given below. Use the sample data to con- 
struct a 95% confidence interval for the mean sensory rate for the population (assumed normal) 
from which you took the data. 
8.6; 9.4; 7.9; 6.8; 8.3; 7.3; 9.2; 9.6; 8.7; 11.4; 10.3; 5.4; 8.1; 5.5; 6.9 

• The solution is shown step by step. 

Solution 

To find the confidence interval, you need the sample mean, x, and the EBM. 

x = 8.2267 s = 1.6722 n = 15 

df = 15 - 1 = 14 

CL = 0.95 so a = 1 - CL = 1 - 0.95 = 0.05 

\ = 0.025 U_ = J.025 

The area to the right of £.025 is 0.025 and the area to the left of f.025 is 1—0.025=0.975 

t« = i.025 = 2.14 using invT(.975,14) on the TI-84+ calculator. 

EBM = tt ■ {A= 

EBM = 2.14 • (i^S \ = 0.924 

x - EBM = 8.2267 - 0.9240 = 7.3 

x + EBM = 8.2267 + 0.9240 = 9.15 

The 95% confidence interval is (7.30, 9.15). 

We estimate with 95% confidence that the true population average sensory rate is between 7.30 
and 9.15. 
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Note: When calculating the error bound, a probability table for the Student-t distribution can also 
be used to find the value of t. The table gives t-scores that correspond to the confidence level 
(column) and degrees of freedom (row); the t-score is found where the row and column intersect 
in the table. 



8.4 Confidence Interval for a Population Proportion (modified R. Bloom) 4 

During an election year, we see articles in the newspaper that state confidence intervals in terms of pro- 
portions or percentages. For example, a poll for a particular candidate running for president might show 
that the candidate has 40% of the vote within 3 percentage points. Often, election polls are calculated with 
95% confidence. So, the pollsters would be 95% confident that the true proportion of voters who favored 
the candidate would be between 0.37 and 0.43 : (0.40-0.03,0.40 + 0.03). 

Investors in the stock market are interested in the true proportion of stocks that go up and down each week. 
Businesses that sell personal computers are interested in the proportion of households in the United States 
that own personal computers. Confidence intervals can be calculated for the true proportion of stocks that 
go up or down each week and for the true proportion of households in the United States that own personal 
computers. 

The procedure to find the confidence interval, the sample size, the error bound, and the confidence level 
for a proportion is similar to that for the population mean. The formulas are different. 

How do you know you are dealing with a proportion problem? First, the underlying distribution is 
binomial. (There is no mention of a mean or average.) If X is a binomial random variable, then X ~ B (n, p) 
where n = the number of trials and p = the probability of a success. To form a proportion, take X, the 
random variable for the number of successes and divide it by n, the number of trials (or the sample size). 
The random variable P' (read "P prime") is that proportion, 

p> _ X 

n 

(Sometimes the random variable is written using the symbol P, read "P hat".) 
When n is large, we can use the normal distribution to approximate the binomial. 

X ~ N (n ■ p, y/n • p • q) 

If we divide all values of the random variable by n, the mean by n, and the standard deviation by n, we 
get a normal distribution of proportions with P', called the estimated proportion, as the random variable. 
(Recall that a proportion = the number of successes divided by n.) 



Using algebra to simplify : - n — ■ — 



P' follows a normal distribution for proportions: P' ~ N I p, W ^ 



4 This content is available online at <http://cnx.Org/content/ml8934/l.2/>. 
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The confidence interval has the form (»' — EBP, v' + EBP). 



p' = the estimated proportion of successes (p' is a point estimate for p, the true proportion) 
x = the number of successes and n = the size of the sample 
The formula for the error bound for a proportion is 
EBP = z rv /££ <f = l-P' 

This formula is similar to the error bound formula for a mean, except that the "appropriate standard devia- 
tion" is different. For a mean, when the population standard deviation is known, the appropriate standard 

deviation that we use is -y= . For a proportion, the appropriate standard deviation is \/ ^ ■ 

However, in the error bound formula, we use y ^-£- as the standard deviation, instead of y ^ 

In the error bound formula, the sample proportions p' and q' are estimates of the unknown population 

proportions p and q. The estimated proportions p' and q' are used because p and q are not known, p' and 
q' are calculated from the data, p' is the estimated proportion of successes, q' is the estimated proportion of 
failures. 

NOTE: For the normal distribution of proportions, the z-score formula is as follows. 
If P' ~ N [ V,y^r ) then the z-score formula is z = p -J^ 



Example 8.8 

Suppose that a market research firm is hired to estimate the percent of adults living in a large 
city who have cell phones. 500 randomly selected adult residents this city are surveyed to deter- 
mine whether they have cell phones. Of the 500 people surveyed, 421 responded yes - they own 
cell phones. Using a 95% confidence level, compute a confidence interval estimate for the true 
proportion of adults residents of this city who have cell phones. 

Solution 

Let X = the number of people in the sample who have cell phones. X is binomial. X ~ 



B 500, 



421 \ 
500 ,/■ 

To calculate the confidence interval, you must find p' , q' , and EBP. 
n = 500 x = the number of successes = 421 

W = * = 421 — n g42 
r n 500 u - OI * z 

p' = 0.842 is the sample proportion; this is the point estimate of the population proportion. 

q' = l-p' = 1- 0.842 = 0.158 

Since CL = 0.95, then a = 1 - CL = 1 - 0.95 = 0.05 f = 0.025. 

2 1 = z .025 = 1-96 

Use the TI-83, 83+ or 84+ calculator command invnorm(. 975,0,1) to find z 925- Remember that the 
area to the right of z.025 is 0.025 and the area to the left of z.025 is 0.975. This can also be found 
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using appropriate commands on other calculators, using a computer, or using a Standard Normal 
probability table. 



EBP = Zf • y] M. = 1.96 • J [ (M » i 158 ) ] = 0.032 



p' - EBP = 0.842 - 0.032 = 0.81 

p' + EBP = 0.842 + 0.032 = 0.874 

The confidence interval for the true binomial population proportion is 
(p' - EBP, p' + EBP) =(0.810,0.874). 

Interpretation 

We estimate with 95% confidence that between 81% and 87.4% of all adult residents of this city 
have cell phones. 

Explanation of 95% Confidence Level 

95% of the confidence intervals constructed in this way would contain the true value for the pop- 
ulation proportion of all adult residents of this city who have cell phones. 



Example 8.9 

For a class project, a political science student at a large university wants to determine the percent 
of students that are registered voters. He surveys 500 students and finds that 300 are registered 
voters. Compute a 90% confidence interval for the true percent of students that are registered 
voters and interpret the confidence interval. 

Solution 

x = 300 and n = 500. Using a TI-83+ or 84 calculator, the 90% confidence interval for the true 
percent of students that are registered voters is (0.564, 0.636). 

„/ _ X _ 300 _ n snn 
V ~ n ~ 500 ~ u.ouu 

q' = l-p' = l- 0.600 = 0.400 

Since CL = 0.90, then a = 1 - CL = 1 - 0.90 = 0.10 § = 0.05. 

z« = z.05 = 1.645 

Use the TI-83, 83+ or 84+ calculator command invnorm(. 95,0,1) to find z.05. Remember that the 
area to the right of z.05 is 0-05 and the area to the left of z.05 is 0.95. This can also be found us- 
ing appropriate commands on other calculators, using a computer, or using a Standard Normal 
probability table. 



EBP = z.-y/£± = 1.645.^/ [&%jft] =0.036 

p' - EBP = 0.60 - 0.036 = 0.564 

p' + EBP = 0.60 + 0.036 = 0.636 
Interpretation: 

• We estimate with 90% confidence that the true percent of all students that are registered 
voters is between 56.4% and 63.6%. 
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• Alternate Wording: We estimate with 90% confidence that between 56.4% and 63.6% of ALL 
students are registered voters. 

Explanation of 90% Confidence Level 

90% of all confidence intervals constructed in this way contain the true value for the population 
percent of students that are registered voters. 



8.4.1 Calculating the Sample Size 

If researchers desire a specific margin of error, then they can use the error bound formula to calcu- 
late the required sample size. 

The error bound formula for a proportion is EBP = z * • J ?-£- . Solving for n gives you an equation 
for the sample size: 

n = V% , where z = zs 

EBP 2 

Example 8.10 

Suppose a mobile phone company wants to determine the current percentage of customers aged 
50+ that use text messaging on their cell phone. How many customers aged 50+ should the com- 
pany survey in order to be 90% confident that the estimated (sample) proportion is within 3 per- 
centage points of the true population proportion of customers aged 50+ that use text messaging 
on their cell phone. 

From the problem, we know that EBP=0.03 (3%=0.03) and z« = z.05 = 1.645 because the confi- 
dence level is 90% 

However, in order to find n , we need to know the estimated (sample) proportion p'. Remember 
that q'=l-p'. But, we do not know p' yet. Since we multiply p' and q' together, we make them both 
equal to 0.5 because p'q'= (.5)(.5)=.25 results in the largest possible product. (Try other products: 
(.6)(.4)=.24; (.3)(.7)=.21; (.2)(.8)=.16 and so on). The largest possible product gives us the largest n. 
This gives us a large enough sample so that we can be 90% confident that we are within 3 percent- 
age points of the true population proportion. To calculate the sample size n, use the formula and 
make the substitutions. 

zVfl' . 1.645 2 (.5)(.5) rrci n 

n = E§pi- g lves n = . 03 2 A -751.7 

Round the answer to the next higher value. The sample size should be 758 cell phone customers 
aged 50+ in order to be 90% confident that the estimated (sample) proportion is within 3 percent- 
age points of the true population proportion of all customers aged 50+ that use text messaging on 
their cell phone. 
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8.5 Summary of Formulas 5 

Formula 8.1: General form of a confidence interval 

(lower value, upper value) = (point estimate — error bound, point estimate + error bound) 

Formula 8.2: To find the error bound when you know the confidence interval 

i j i ■ i j.- j. r^n t. j upper value-lower value 

error bound = upper value — point estimate OR error bound = -*-*- j 

Formula 8.3: Single Population Mean, Known Standard Deviation, Normal Distribution 
Use the Normal Distribution for Means (Section 7.2) EBM = z * • -j- 

The confidence interval has the format (x — EBM, x + EBM) . 

Formula 8.4: Single Population Mean, Unknown Standard Deviation, Student' s-t Distribution 

Use the Student' s-t Distribution with degrees of freedom df = n — 1. EBM = tx • -4= 

° 2 v« 

Formula 8.5: Single Population Proportion, Normal Distribution 

Use the Normal Distribution for a single population proportion p' ' — | 

EBP = z rv /^ p' + q' = l 

The confidence interval has the format (p' — EBP, p' + EBP). 

Formula 8.6: Point Estimates 
x is a point estimate for \i 
p' is a point estimate for p 

s is a point estimate for u 



5 This content is available online at <http://cnx.org/content/ml6973/1.8/>. 
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8.6 Practice 1: Confidence Intervals for Averages, Known Population 
Standard Deviation 6 

8.6.1 Student Learning Outcomes 

• The student will calculate confidence intervals for means when the population standard deviation is 
known. 

8.6.2 Given 

The mean age for all Foothill College students for a recent Fall term was 33.2. The population standard de- 
viation has been pretty consistent at 15. Suppose that twenty-five Winter students were randomly selected. 
The mean age for the sample was 30.4. We are interested in the true mean age for Winter Foothill College 
students, (http: / /research. fhda.edu/factbook/FH_Demo_Trends/FoothillDemographicTrends. htm 7 

Let X = the age of a Winter Foothill College student 

8.6.3 Calculating the Confidence Interval 

Exercise 8.6.1 (Solution on p. 223.) 

x = 

Exercise 8.6.2 (Solution on p. 223.) 

n= 

Exercise 8.6.3 (Solution on p. 223.) 

15= (insert symbol here) 

Exercise 8.6.4 (Solution on p. 223.) 

Define the Random Variable, X, in words. 

X = 

Exercise 8.6.5 (Solution on p. 223.) 

What is x estimating? 

Exercise 8.6.6 (Solution on p. 223.) 

Is C x known? 

Exercise 8.6.7 (Solution on p. 223.) 

As a result of your answer to (4), state the exact distribution to use when calculating the Confi- 
dence Interval. 



8.6.4 Explaining the Confidence Interval 

Construct a 95% Confidence Interval for the true mean age of Winter Foothill College students. 

Exercise 8.6.8 (Solution on p. 223.) 

How much area is in both tails (combined)? a. = 

Exercise 8.6.9 (Solution on p. 223.) 

How much area is in each tail? | = 



Exercise 8.6.10 (Solution on p. 223.) 

Identify the following specifications: 



6 This content is available online at <http://cnx.Org/content/ml6970/l.13/>. 
7 http://research.£hda.edu/factbook/FH_Demo_Trends/FoothillDemographicTrends.htm 



216 



CHAPTER 8. CONFIDENCE INTERVALS 



a. lower limit = 

b. upper limit = 

c. error bound = 



Exercise 8.6.11 

The 95% Confidence Interval is:. 

Exercise 8.6.12 



(Solution on p. 223.) 



Fill in the blanks on the graph with the areas, upper and lower limits of the Confidence Interval, and 

the sample mean. 



a 

-■ 



C.L. 



a 
j 




X 



Figure 8.2 



Exercise 8.6.13 

In one complete sentence, explain what the interval means. 



8.6.5 Discussion Questions 

Exercise 8.6.14 

Using the same mean, standard deviation and level of confidence, suppose that n were 69 instead 
of 25. Would the error bound become larger or smaller? How do you know? 

Exercise 8.6.15 

Using the same mean, standard deviation and sample size, how would the error bound change if 
the confidence level were reduced to 90%? Why? 
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8.7 Practice 2: Confidence Intervals for Averages, Unknown Population 
Standard Deviation 8 

8.7.1 Student Learning Outcomes 

• The student will calculate confidence intervals for means when the population standard deviation is 
unknown. 



8.7.2 Given 

The following real data are the result of a random survey of 39 national flags (with replacement between 
picks) from various countries. We are interested in finding a confidence interval for the true mean number 
of colors on a national flag. Let X = the number of colors on a national flag. 



X 


Freq. 


1 


1 


2 


7 


3 


18 


4 


7 


5 


6 



Table 8.1 



8.7.3 Calculating the Confidence Interval 

Exercise 8.7.1 
Calculate the following: 

a. x = 

b. s x = 

c. n = 



(Solution on p. 223.) 



Exercise 8.7.2 

Define the Random Variable, X, in words. X = 

Exercise 8.7.3 

What is x estimating? 

Exercise 8.7.4 
Is C x known? 

Exercise 8.7.5 (Solution on p. 223.) 

As a result of your answer to (4), state the exact distribution to use when calculating the Confi- 
dence Interval. 



(Solution on p. 223.) 
(Solution on p. 223.) 
(Solution on p. 223.) 



8 This content is available online at <http://cnx.Org/content/ml6971/l.14/>. 



218 CHAPTER 8. CONFIDENCE INTERVALS 

8.7.4 Confidence Interval for the True Mean Number 

Construct a 95% Confidence Interval for the true mean number of colors on national flags. 

Exercise 8.7.6 (Solution on p. 223.) 

How much area is in both tails (combined)? a. = 

Exercise 8.7.7 (Solution on p. 223.) 

How much area is in each tail? | = 

Exercise 8.7.8 (Solution on p. 223.) 

Calculate the following: 

a. lower limit = 

b. upper limit = 

c. error bound = 

Exercise 8.7.9 (Solution on p. 224.) 

The 95% Confidence Interval is: 

Exercise 8.7.10 

Fill in the blanks on the graph with the areas, upper and lower limits of the Confidence Interval 
and the sample mean. 

* = CL= * = 

2 2 




X 



Figure 8.3 



Exercise 8.7.11 

In one complete sentence, explain what the interval means. 



8.7.5 Discussion Questions 

Exercise 8.7.12 

Using the same x, s x , and level of confidence, suppose that n were 69 instead of 39. Would the 
error bound become larger or smaller? How do you know? 

Exercise 8.7.13 

Using the same x, s x , and n — 39, how would the error bound change if the confidence level were 

reduced to 90%? Why? 
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8.8 Practice 3: Confidence Intervals for Proportions 9 
8.8.1 Student Learning Outcomes 

• The student will calculate confidence intervals for proportions. 



8.8.2 Given 

The Ice Chalet offers dozens of different beginning ice-skating classes. All of the class names are put into a 
bucket. The 5 P.M., Monday night, ages 8 - 12, beginning ice-skating class was picked. In that class were 64 
girls and 16 boys. Suppose that we are interested in the true proportion of girls, ages 8 - 12, in all beginning 
ice-skating classes at the Ice Chalet. Assume that the children in the selected class is a random sample of 
the population. 

8.8.3 Estimated Distribution 

Exercise 8.8.1 

What is being counted? 

Exercise 8.8.2 (Solution on p. 224.) 

In words, define the Random Variable X. X — 

Exercise 8.8.3 (Solution on p. 224.) 

Calculate the following: 

a. x = 

b. n = 

c. p' = 

Exercise 8.8.4 (Solution on p. 224.) 

State the estimated distribution of X. X ~ 

Exercise 8.8.5 (Solution on p. 224.) 

Define a new Random Variable P' . What is p' estimating? 

Exercise 8.8.6 (Solution on p. 224.) 

In words, define the Random Variable P' . P' — 

Exercise 8.8.7 

State the estimated distribution of P' '. P' ~ 



8.8.4 Explaining the Confidence Interval 

Construct a 92% Confidence Interval for the true proportion of girls in the age 8-12 beginning ice-skating 
classes at the Ice Chalet. 

Exercise 8.8.8 (Solution on p. 224.) 

How much area is in both tails (combined)? a = 

Exercise 8.8.9 (Solution on p. 224.) 

How much area is in each tail? j = 

Exercise 8.8.10 (Solution on p. 224.) 

Calculate the following: 



9 This content is available online at <http://cnx.Org/content/ml6968/l.13/>. 
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a. lower limit = 

b. upper limit = 

c. error bound = 



Exercise 8.8.11 

The 92% Confidence Interval is: 

Exercise 8.8.12 



(Solution on p. 224.) 



Fill in the blanks on the graph with the areas, upper and lower limits of the Confidence Interval, and 

the sample proportion. 



a 



C.L.= 



a 




P 



Figure 8.4 



Exercise 8.8.13 

In one complete sentence, explain what the interval means. 



8.8.5 Discussion Questions 

Exercise 8.8.14 

Using the same p' and level of confidence, suppose that n were increased to 100. Would the error 
bound become larger or smaller? How do you know? 

Exercise 8.8.15 

Using the same p' and n = 80, how would the error bound change if the confidence level were 
increased to 98%? Why? 

Exercise 8.8.16 

If you decreased the allowable error bound, why would the minimum sample size increase (keep- 
ing the same level of confidence)? 
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8.9 Homework Link 10 

Link to homework questions in Homework Collection /Book for Confidence Intervals Chapter 8 of Collab- 
orative Statistics for R. Bloom 

Chapter 8 Homework Problems ( http://cnx.org/content/ml6966/latest/ ) 



10 This content is available online at <http://cnx.Org/content/ml9042/l.l/>. 
n http://cnx.org/content/ml6966/latest/ 
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8.10 Review Questions Link 12 

Link to Review Questions in Homework Collection /Book for Confidence Intervals Chapter 8 of Collabora- 
tive Statistics for R. Bloom 

Chapter 8 Review Questions ( http://cnx.org/content/ml9018/latest/ ) 13 



12 This content is available online at <http://cnx.Org/content/ml9037/l.l/>. 
13 http://cnx.org/content/ml9018/latest/ 
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Solutions to Exercises in Chapter 8 

Solutions to Practice 1: Confidence Intervals for Averages, Known Population Stan- 
dard Deviation 

Solution to Exercise 8.6.1 (p. 215) 
30.4 

Solution to Exercise 8.6.2 (p. 215) 
25 

Solution to Exercise 8.6.3 (p. 215) 
a 

Solution to Exercise 8.6.4 (p. 215) 

the mean age of 25 randomly selected Winter Foothill students 
Solution to Exercise 8.6.5 (p. 215) 

V- 
Solution to Exercise 8.6.6 (p. 215) 

yes 

Solution to Exercise 8.6.7 (p. 215) 

Normal 

Solution to Exercise 8.6.8 (p. 215) 

0.05 

Solution to Exercise 8.6.9 (p. 215) 

0.025 

Solution to Exercise 8.6.10 (p. 215) 

a. 24.52 

b. 36.28 

c. 5.88 

Solution to Exercise 8.6.11 (p. 216) 

(24.52,36.28) 

Solutions to Practice 2: Confidence Intervals for Averages, Unknown Population Stan- 
dard Deviation 

Solution to Exercise 8.7.1 (p. 217) 

a. 3.26 

b. 1.02 

c. 39 

Solution to Exercise 8.7.2 (p. 217) 
the mean number of colors of 39 flags 
Solution to Exercise 8.7.3 (p. 217) 

V- 
Solution to Exercise 8.7.4 (p. 217) 

No 

Solution to Exercise 8.7.5 (p. 217) 

Solution to Exercise 8.7.6 (p. 218) 
0.05 

Solution to Exercise 8.7.7 (p. 218) 
0.025 
Solution to Exercise 8.7.8 (p. 218) 
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a. 2.93 

b. 3.59 

c. 0.33 

Solution to Exercise 8.7.9 (p. 218) 

2.93; 3.59 

Solutions to Practice 3: Confidence Intervals for Proportions 

Solution to Exercise 8.8.2 (p. 219) 

The number of girls, age 8-12, in the beginning ice skating class 

Solution to Exercise 8.8.3 (p. 219) 

a. 64 

b. 80 

c. 0.8 

Solution to Exercise 8.8.4 (p. 219) 
B (80,0.80) 
Solution to Exercise 8.8.5 (p. 219) 

V 
Solution to Exercise 8.8.6 (p. 219) 

The proportion of girls, age 8-12, in the beginning ice skating class. 
Solution to Exercise 8.8.8 (p. 219) 

1 - 0.92 = 0.08 
Solution to Exercise 8.8.9 (p. 219) 

0.04 

Solution to Exercise 8.8.10 (p. 219) 

a. 0.72 

b. 0.88 

c. 0.08 

Solution to Exercise 8.8.11 (p. 220) 

(0.72; 0.88) 



Chapter 9 

Hypothesis Testing: Single Mean and 
Single Proportion 

9.1 Hypothesis Testing: Single Mean and Single Proportion 1 

9.1.1 Student Learning Outcomes 

By the end of this chapter, the student should be able to: 

• Differentiate between Type I and Type II Errors 

• Describe hypothesis testing in general and in practice 

• Conduct and interpret hypothesis tests for a single population mean, population standard deviation 
known. 

• Conduct and interpret hypothesis tests for a single population mean, population standard deviation 
unknown. 

• Conduct and interpret hypothesis tests for a single population proportion. 

9.1.2 Introduction 

One job of a statistician is to make statistical inferences about populations based on samples taken from the 
population. Confidence intervals are one way to estimate a population parameter. Another way to make 
a statistical inference is to make a decision about a parameter. For instance, a car dealer advertises that 
its new small truck gets 35 miles per gallon, on the average. A tutoring service claims that its method of 
tutoring helps 90% of its students get an A or a B. A company says that women managers in their company 
earn an average of $60,000 per year. 

A statistician will make a decision about these claims. This process is called "hypothesis testing." A hy- 
pothesis test involves collecting data from a sample and evaluating the data. Then, the statistician makes a 
decision as to whether or not there is sufficient evidence based upon analyses of the data, to reject the null 
hypothesis. 

In this chapter, you will conduct hypothesis tests on single means and single proportions. You will also 
learn about the errors associated with these tests. 

Hypothesis testing consists of two contradictory hypotheses or statements, a decision based on the data, 
and a conclusion. To perform a hypothesis test, a statistician will: 



1 This content is available online at <http://cnx.Org/content/ml6997/l.ll/>. 
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PROPORTION 

1. Set up two contradictory hypotheses. 

2. Collect sample data (in homework problems, the data or summary statistics will be given to you). 

3. Determine the correct distribution to perform the hypothesis test. 

4. Analyze sample data by performing the calculations that ultimately will allow you to reject or fail to 
reject the null hypothesis. 

5. Make a decision and write a meaningful conclusion. 

NOTE: To do the hypothesis test homework problems for this chapter and later chapters, make 
copies of the appropriate special solution sheets. See the Table of Contents topic "Solution Sheets". 



9.2 Null and Alternate Hypotheses 2 

The actual test begins by considering two hypotheses. They are called the null hypothesis and the alternate 
hypothesis. These hypotheses contain opposing viewpoints. 

H : The null hypothesis: It is a statement about the population that will be assumed to be true unless it 
can be shown to be incorrect beyond a reasonable doubt. 

H a : The alternate hypothesis: It is a claim about the population that is contradictory to H and what we 
conclude when we reject H . 

Example 9.1 

H : No more than 30% of the registered voters in Santa Clara County voted in the primary election. 

H a : More than 30% of the registered voters in Santa Clara County voted in the primary election. 

Example 9.2 

We want to test whether the mean grade point average in American colleges is different from 2.0 
(out of 4.0). 

H : ji = 2.0 H a : ]i £ 2.0 

Example 9.3 

We want to test if college students take less than five years to graduate from college, on the aver- 
age. 

H : }i>5 H a : \i < 5 

Example 9.4 

In an issue of U. S. News and World Report, an article on school standards stated that about half 
of all students in France, Germany, and Israel take advanced placement exams and a third pass. 
The same article stated that 6.6% of U. S. students take advanced placement exams and 4.4 % pass. 
Test if the percentage of U. S. students who take advanced placement exams is more than 6.6%. 

H : p= 0.066 H a : p > 0.066 

Since the null and alternate hypotheses are contradictory, you must examine evidence to decide if you have 
enough evidence to reject the null hypothesis or not. The evidence is in the form of sample data. 

After you have determined which hypothesis the sample supports, you make a decision. There are two 
options for a decision. They are "reject H " if the sample information favors the alternate hypothesis or "do 
not reject H " or "fail to reject H " if the sample information is insufficient to reject the null hypothesis. 



2 This content is available online at <http://cnx.Org/content/ml6998/l.14/>. 
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Mathematical Symbols Used in H and H a : 



H 


H a 


equal (=) 


not equal (7^) or greater than (> ) or less than (<) 


greater than or equal to (>) 


less than (<) 


less than or equal to (<) 


more than ( > ) 



Table 9.1 



NOTE: H always has a symbol with an equal in it. H a never has a symbol with an equal in it. The 
choice of symbol depends on the wording of the hypothesis test. However, be aware that many 
researchers (including one of the co-authors in research work) use = in the Null Hypothesis, even 
with > or < as the symbol in the Alternate Hypothesis. This practice is acceptable because we 
only make the decision to reject or not reject the Null Hypothesis. 



9.2.1 Optional Collaborative Classroom Activity 

Bring to class a newspaper, some news magazines, and some Internet articles . In groups, find articles from 
which your group can write a null and alternate hypotheses. Discuss your hypotheses with the rest of the 
class. 

9.3 Outcomes and the Type I and Type II Errors 3 

When you perform a hypothesis test, there are four possible outcomes depending on the actual truth (or 
falseness) of the null hypothesis H and the decision to reject or not. The outcomes are summarized in the 
following table: 



ACTION 


H IS ACTUALLY 






True 


False 


Do not reject H 


Correct Outcome 


Type II error 


Reject H 


Type I Error 


Correct Outcome 



Table 9.2 



The four possible outcomes in the table are: 

• The decision is to not reject H when, in fact, H is true (correct decision). 

• The decision is to reject H when, in fact, H is true (incorrect decision known as a Type I error). 

• The decision is to not reject H when, in fact, H is false (incorrect decision known as a Type II error). 

• The decision is to reject H when, in fact, H is false (correct decision whose probability is called the 
Power of the Test). 

Each of the errors occurs with a particular probability. The Greek letters a and f> represent the probabilities. 

a = probability of a Type I error = P(Type I error) = probability of rejecting the null hypothesis when the 
null hypothesis is true. 



3 This content is available online at <http://cnx.Org/content/ml7006/l.8/>. 
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/3 = probability of a Type II error = P(Type II error) = probability of not rejecting the null hypothesis when 
the null hypothesis is false. 

ol and /3 should be as small as possible because they are probabilities of errors. They are rarely 0. 

The Power of the Test is 1 — /3. Ideally, we want a high power that is as close to 1 as possible. Increasing the 
sample size can increase the Power of the Test. 

The following are examples of Type I and Type II errors. 

Example 9.5 

Suppose the null hypothesis, H , is: Frank's rock climbing equipment is safe. 

Type I error: Frank thinks that his rock climbing equipment may not be safe when, in fact, it really 
is safe. Type II error: Frank thinks that his rock climbing equipment may be safe when, in fact, it 
is not safe. 

a. = probability that Frank thinks his rock climbing equipment may not be safe when, in fact, it 
really is safe. /3 = probability that Frank thinks his rock climbing equipment may be safe when, in 
fact, it is not safe. 

Notice that, in this case, the error with the greater consequence is the Type II error. (If Frank thinks 
his rock climbing equipment is safe, he will go ahead and use it.) 

Example 9.6 

Suppose the null hypothesis, H , is: The victim of an automobile accident is alive when he arrives 
at the emergency room of a hospital. 

Type I error: The emergency crew thinks that the victim is dead when, in fact, the victim is alive. 
Type II error: The emergency crew does not know if the victim is alive when, in fact, the victim is 
dead. 

a = probability that the emergency crew thinks the victim is dead when, in fact, he is really alive 
= P(Type I error). f> = probability that the emergency crew does not know if the victim is alive 
when, in fact, the victim is dead = P(Type II error). 

The error with the greater consequence is the Type I error. (If the emergency crew thinks the victim 
is dead, they will not treat him.) 



9.4 Distribution Needed for Hypothesis Testing 4 

Earlier in the course, we discussed sampling distributions. Particular distributions are associated with 
hypothesis testing. Perform tests of a population mean using a normal distribution or a student's-t dis- 
tribution. (Remember, use a student's-t distribution when the population standard deviation is unknown 
and the distribution of the sample mean is approximately normal.) In this chapter we perform tests of a 
population proportion using a normal distribution (usually n is large or the sample size is large). 

If you are testing a single population mean, the distribution for the test is for means: 

X~N(Vx,^) or t df 



4 This content is available online at <http://cnx.Org/content/ml7017/l.13/>. 
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The population parameter is p. The estimated value (point estimate) for p is x, the sample mean. 

If you are testing a single population proportion, the distribution for the test is for proportions or percent- 
ages: 

75 



P' ~ N I p, 

The population parameter is p. The estimated value (point estimate) for p is p' . p' — \ where x is the 
number of successes and n is the sample size. 

9.5 Assumption 5 

When you perform a hypothesis test of a single population mean p using a Student's-t distribution (often 
called a t-test), there are fundamental assumptions that need to be met in order for the test to work prop- 
erly. Your data should be a simple random sample that comes from a population that is approximately 
normally distributed. You use the sample standard deviation to approximate the population standard 
deviation. (Note that if the sample size is sufficiently large, a t-test will work even if the population is not 
approximately normally distributed). 

When you perform a hypothesis test of a single population mean p using a normal distribution (often 
called a z-test), you take a simple random sample from the population. The population you are testing 
is normally distributed or your sample size is sufficiently large. You know the value of the population 
standard deviation. 

When you perform a hypothesis test of a single population proportion p, you take a simple random 
sample from the population. You must meet the conditions for a binomial distribution which are there are 
a certain number n of independent trials, the outcomes of any trial are success or failure, and each trial has 
the same probability of a success p. The shape of the binomial distribution needs to be similar to the shape 
of the normal distribution. To ensure this, the quantities np and nq must both be greater than five (np > 5 
and nq > 5). Then the binomial distribution of sample (estimated) proportion can be approximated by the 

normal distribution with p = p and <x — J ^. Remember that q = 1 — p. 

9.6 Rare Events 6 

Suppose you make an assumption about a property of the population (this assumption is the null hypoth- 
esis). Then you gather sample data randomly. If the sample has properties that would be very unlikely 
to occur if the assumption is true, then you would conclude that your assumption about the population is 
probably incorrect. (Remember that your assumption is just an assumption - it is not a fact and it may or 
may not be true. But your sample data are real and the data are showing you a fact that seems to contradict 
your assumption.) 

For example, Didi and Ali are at a birthday party of a very wealthy friend. They hurry to be first in line 
to grab a prize from a tall basket that they cannot see inside because they will be blindfolded. There are 
200 plastic bubbles in the basket and Didi and Ali have been told that there is only one with a $100 bill. 
Didi is the first person to reach into the basket and pull out a bubble. Her bubble contains a $100 bill. The 
probability of this happening is jjjjj = 0.005. Because this is so unlikely, Ali is hoping that what the two 
of them were told is wrong and there are more $100 bills in the basket. A "rare event" has occurred (Didi 
getting the $100 bill) so Ali doubts the assumption about only one $100 bill being in the basket. 

5 This content is available online at <http://cnx.Org/content/ml7002/l.16/>. 
6 This content is available online at <http://cnx.Org/content/ml6994/l.8/>. 
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9.7 Using the Sample to Support One of the Hypotheses 7 

Use the sample data to calculate the actual probability of getting the test result, called the p-value. The 
p-value is the probability that, if the null hypothesis is true, the results from another randomly selected 
sample will be as extreme or more extreme as the results obtained from the given sample. 

A large p-value calculated from the data indicates that we should fail to reject the null hypothesis. The 
smaller the p-value, the more unlikely the outcome, and the stronger the evidence is against the null hy- 
pothesis. We would reject the null hypothesis if the evidence is strongly against it. 

Draw a graph that shows the p-value. The hypothesis test is easier to perform if you use a graph because 
you see the problem more clearly. 

Example 9.7: (to illustrate the p-value) 

Suppose a baker claims that his bread height is more than 15 cm, on the average. Several of his 
customers do not believe him. To persuade his customers that he is right, the baker decides to do a 
hypothesis test. He bakes 10 loaves of bread. The mean height of the sample loaves is 17 cm. The 
baker knows from baking hundreds of loaves of bread that the standard deviation for the height 
is 0.5 cm. and the distribution of heights is normal. 

The null hypothesis could be H : \i < 15 The alternate hypothesis is H a : ji > 15 

The words "is more than" translates as a "> " so "\i > 15" goes into the alternate hypothesis. The 
null hypothesis must contradict the alternate hypothesis. 



Since a is known (a — 0.5 cm.), the distribution for the population is known to be normal with 

q_ _ _05_ 



mean ]i— 15 and standard deviation -£= = -R= = 0.16. 



Suppose the null hypothesis is true (the mean height of the loaves is no more than 15 cm). Then 
is the mean height (17 cm) calculated from the sample unexpectedly large? The hypothesis test 
works by asking the question how unlikely the sample mean would be if the null hypothesis 
were true. The graph shows how far out the sample mean is on the normal curve. The p-value is 
the probability that, if we were to take other samples, any other sample mean would fall at least 
as far out as 17 cm. 

The p-value, then, is the probability that a sample mean is the same or greater than 17 cm. 
when the population mean is, in fact, 15 cm. We can calculate this probability using the normal 
distribution for means from Chapter 7. 

p-value is 
approximately 




17 

p-value = P (x > 17) which is approximately 0. 

7 This content is available online at <http://cnx.Org/content/ml6995/l.17/>. 



231 



A p-value of approximately tells us that it is highly unlikely that a loaf of bread rises no more 
than 15 cm, on the average. That is, almost 0% of all loaves of bread would be at least as high 
as 17 cm. purely by CHANCE had the population mean height really been 15 cm. Because the 
outcome of 17 cm. is so unlikely (meaning it is happening NOT by chance alone), we conclude 
that the evidence is strongly against the null hypothesis (the mean height is at most 15 cm.). There 
is sufficient evidence that the true mean height for the population of the baker's loaves of bread is 
greater than 15 cm. 



9.8 Decision and Conclusion 8 

A systematic way to make a decision of whether to reject or not reject the null hypothesis is to compare the 
p-value and a preset or preconceived a (also called a "significance level"). A preset a is the probability of 
a Type I error (rejecting the null hypothesis when the null hypothesis is true). It may or may not be given 
to you at the beginning of the problem. 

When you make a decision to reject or not reject H , do as follows: 

• If a > p-value, reject H . The results of the sample data are significant. There is sufficient evidence to 
conclude that H is an incorrect belief and that the alternative hypothesis, H a , may be correct. 

• If oc < p-value, do not reject H . The results of the sample data are not significant. There is not 
sufficient evidence to conclude that the alternative hypothesis, H a , may be correct. 

• When you "do not reject H ", it does not mean that you should believe that H is true. It simply 
means that the sample data have failed to provide sufficient evidence to cast serious doubt about the 
truthfulness of H . 

Conclusion: After you make your decision, write a thoughtful conclusion about the hypotheses in terms 
of the given problem. 

9.9 Additional Information 9 

• In a hypothesis test problem, you may see words such as "the level of significance is 1%." The "1%" is 
the preconceived or preset a. 

• The statistician setting up the hypothesis test selects the value of a to use before collecting the sample 
data. 

• If no level of significance is given, the accepted standard is to use a = 0.05. 

• When you calculate the p-value and draw the picture, the p-value is the area in the left tail, the right 
tail, or split evenly between the two tails. For this reason, we call the hypothesis test left, right, or two 
tailed. 

• The alternate hypothesis, H a , tells you if the test is left, right, or two-tailed. It is the key to conducting 
the appropriate test. 

• H a never has a symbol that contains an equal sign. 

• Thinking about the meaning of the p-value: A data analyst (and anyone else) should have more 
confidence that he made the correct decision to reject the null hypothesis with a smaller p-value (for 
example, 0.001 as opposed to 0.04) even if using the 0.05 level for alpha. Similarly, for a large p-value 
like 0.4, as opposed to a p-value of 0.056 (alpha = 0.05 is less than either number), a data analyst should 
have more confidence that she made the correct decision in failing to reject the null hypothesis. This 
makes the data analyst use judgment rather than mindlessly applying rules. 

8 This content is available online at <http://cnx.Org/content/ml6992/l.ll/>. 
9 This content is available online at <http://cnx.Org/content/ml6999/l.13/>. 
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The following examples illustrate a left, right, and two-tailed test. 

Example 9.8 

H : ]i = 5 H a : fi < 5 

Test of a single population mean. H a tells you the test is left-tailed. The picture of the p-value is as 
follows: 



p-value 




Example 9.9 

H : p < 0.2 H a : p > 0.2 

This is a test of a single population proportion. H a tells you the test is right-tailed. The picture of 
the p-value is as follows: 



p-value 




Example 9.10 

H : fi = 50 H a \]i^ 50 

This is a test of a single population mean. H a tells you the test is two-tailed. The picture of the 
p-value is as follows. 



— (p-value) 



— (p-value) 



50 



x 
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9.10 Summary of the Hypothesis Test 10 

The hypothesis test itself has an established process. This can be summarized as follows: 

1 . Determine H and H a . Remember, they are contradictory. 

2. Determine the random variable. 

3. Determine the distribution for the test. 

4. Draw a graph, calculate the test statistic, and use the test statistic to calculate the p-value. (A z-score 
and a t-score are examples of test statistics.) 

5. Compare the preconceived a with the p-value, make a decision (reject or do not reject H ), and write 
a clear conclusion using English sentences. 

Notice that in performing the hypothesis test, you use a and not /3. /3 is needed to help determine the 
sample size of the data that is used in calculating the p-value. Remember that the quantity 1 — /3 is called 
the Power of the Test. A high power is desirable. If the power is too low, statisticians typically increase the 
sample size while keeping a the same. If the power is low, the null hypothesis might not be rejected when 
it should be. 



9.11 Examples 11 



Example 9.11 

Jeffrey, as an eight-year old, established a mean time of 16.43 seconds for swimming the 25-yard 
freestyle, with a standard deviation of 0.8 seconds. His dad, Frank, thought that Jeffrey could 
swim the 25-yard freestyle faster by using goggles. Frank bought Jeffrey a new pair of expensive 
goggles and timed Jeffrey for 15 25-yard freestyle swims. For the 15 swims, Jeffrey's mean time 
was 16 seconds. Frank thought that the goggles helped Jeffrey to swim faster than the 16.43 
seconds. Conduct a hypothesis test using a preset a = 0.05. Assume that the swim times for the 
25-yard freestyle are normal. 

Solution 

Set up the Hypothesis Test: 

Since the problem is about a mean, this is a test of a single population mean. 

H : ji = 16.43 H a \]i< 16.43 

For Jeffrey to swim faster, his time will be less than 16.43 seconds. The "<" tells you this is left- 
tailed. 

Determine the distribution needed: 

Random variable: X = the mean time to swim the 25-yard freestyle. 

Distribution for the test: X is normal (population standard deviation is known: a = 0.8) 

X~N (f/770 Therefore, X - N ( 16.43, M^ 

\i — 16.43 comes from Hq and not the data, a = 0.8, and n = 15. 

Calculate the p-value using the normal distribution for a mean: 



10 This content is available online at <http://cnx.Org/content/ml6993/l.6/>. 
n This content is available online at <http://cnx.Org/content/ml7005/l.25/>. 
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p-value = P I X< 16 J = 0.0187 where the sample mean in the problem is given as 16. 

p-value = 0.0187 (This is called the actual level of significance.) The p-value is the area to the left 
of the sample mean is given as 16. 

Graph: 

p-value 

x= 16 

p. = 16.43 




16 16.43 



Figure 9.1 



}i — 16 A3 comes from H . Our assumption is ji = 16.43. 



Interpretation of the p-value: If H is true, there is a 0.0187 probability (1.87%) that Jeffrey's mean 
time to swim the 25-yard freestyle is 16 seconds or less. Because a 1.87% chance is small, the mean 
time of 16 seconds or less is unlikely to have happened randomly. It is a rare event. 

Compare a and the p-value: 

a = 0.05 p-value = 0.0187 a > p-value 

Make a decision: Since a. > p-value, reject H . 

This means that you reject \i = 16.43. In other words, you do not think Jeffrey swims the 25-yard 
freestyle in 16.43 seconds but faster with the new goggles. 

Conclusion: At the 5% significance level, we conclude that Jeffrey swims faster using the new 
goggles. The sample data show there is sufficient evidence that Jeffrey's mean time to swim the 
25-yard freestyle is less than 16.43 seconds. 

The p-value can easily be calculated using the TI-83+ and the TT84 calculators: 

Press STAT and arrow over to TESTS. Press 1 : Z-Test. Arrow over to Stats and press ENTER. Arrow 
down and enter 16.43 for jiq (null hypothesis), .8 for a, 16 for the sample mean, and 15 for n. Arrow 
down to \v. (alternate hypothesis) and arrow over to <}Iq- Press ENTER. Arrow down to Calculate 
and press ENTER. The calculator not only calculates the p-value (p = 0.0187) but it also calculates 
the test statistic (z-score) for the sample mean. \i < 16.43 is the alternate hypothesis. Do this set 
of instructions again except arrow to Draw (instead of Calculate). Press ENTER. A shaded graph 
appears with z = —2.08 (test statistic) and p — 0.0187 (p-value). Make sure when you use Draw 
that no other equations are highlighted in Y = and the plots are turned off. 
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When the calculator does a Z-Test, the Z-Test function finds the p-value by doing a normal prob- 
ability calculation using the Central Limit Theorem: 



P(x < 16) = 2nd DISTR normcdf (-10 A 99, 16, 16.43,0.8/ V15 

The Type I and Type II errors for this problem are as follows: 

The Type I error is to conclude that Jeffrey swims the 25-yard freestyle, on average, in less than 
16.43 seconds when, in fact, he actually swims the 25-yard freestyle, on average, in 16.43 seconds. 
(Reject the null hypothesis when the null hypothesis is true.) 

The Type II error is that there is not evidence to conclude that Jeffrey swims the 25-yard free-style, 
on average, in less than 16.43 seconds when, in fact, he actually does swim the 25-yard free-style, 
on average, in less than 16.43 seconds. (Do not reject the null hypothesis when the null hypothesis 
is false.) 



Historical Note: The traditional way to compare the two probabilities, cc and the p-value, is to compare 
the critical value (z-score from cc) to the test statistic (z-score from data). The calculated test statistic for the 
p-value is —2.08. (From the Central Limit Theorem, the test statistic formula is z = j^x ■ For this problem, 

x — 16, fix — 16.43 from the null hypothesis, <Xx = 0.8, and n = 15.) You can find the critical value for 
cc = 0.05 in the normal table (see 15.Tables in the Table of Contents). The z-score for an area to the left 
equal to 0.05 is midway between -1.65 and -1.64 (0.05 is midway between 0.0505 and 0.0495). The z-score is 
-1.645. Since —1.645 > — 2.08 (which demonstrates that cc > p-value), reject H . Traditionally, the decision 
to reject or not reject was done in this way. Today, comparing the two probabilities cc and the p-value is very 
common. For this problem, the p-value, 0.0187 is considerably smaller than cc, 0.05. You can be confident 
about your decision to reject. The graph shows cc, the p-value, and the test statistics and the critical value. 



a = 0.05 




p-value == 0.0187 

-2.08 -1,645 

Figure 9.2 



Example 9.12 

A college football coach thought that his players could bench press a mean weight of 275 pounds. 
It is known that the standard deviation is 55 pounds. Three of his players thought that the mean 
weight was more than that amount. They asked 30 of their teammates for their estimated maxi- 
mum lift on the bench press exercise. The data ranged from 205 pounds to 385 pounds. The actual 
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different weights were (frequencies are in parentheses) 205(3); 215(3); 225(1); 241(2); 252(2); 265(2); 
275(2); 313(2); 316(5); 338(2); 341(1); 345(2); 368(2); 385(1). (Source: data from Reuben Davis, Kraig 
Evans, and Scott Gunderson.) 

Conduct a hypothesis test using a 2.5% level of significance to determine if the bench press mean 
is more than 275 pounds. 

Solution 

Set up the Hypothesis Test: 

Since the problem is about a mean weight, this is a test of a single population mean. 

H : pi = 275 H a : fi> 275 This is a right-tailed test. 

Calculating the distribution needed: 

Random variable: X = the mean weight, in pounds, lifted by the football players. 

Distribution for the test: It is normal because <r is known. 

X~N (275,-%;) 
V V30J 

x — 286.2 pounds (from the data). 

cr = 55 pounds (Always use cr if you know it.) We assume }i = 275 pounds unless our data shows 
us otherwise. 

Calculate the p-value using the normal distribution for a mean and using the sample mean as 
input (see the calculator instructions below for using the data as input): 

p-value = P ( x > 286.2) = 0.1323. 

Interpretation of the p-value: If H is true, then there is a 0.1331 probability (13.23%) that the 
football players can lift a mean weight of 286.2 pounds or more. Because a 13.23% chance is large 
enough, a mean weight lift of 286.2 pounds or more is not a rare event. 



x = 286.2 p-value = 0.1323 



275 286.2 



Figure 9.3 



Compare a and the p-value: 
a = 0.025 p-value = 0.1323 
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Make a decision: Since a<p-value, do not reject H . 

Conclusion: At the 2.5% level of significance, from the sample data, there is not sufficient evidence 
to conclude that the true mean weight lifted is more than 275 pounds. 

The p-value can easily be calculated using the TI-83+ and the TI-84 calculators: 

Put the data and frequencies into lists. Press STAT and arrow over to TESTS. Press 1 : Z-Test. Arrow 
over to Data and press ENTER. Arrow down and enter 275 for jaq, 55 for c, the name of the list where 
you put the data, and the name of the list where you put the frequencies. Arrow down to }i : and 
arrow over to > }Iq. Press ENTER. Arrow down to Calculate and press ENTER. The calculator not 
only calculates the p-value (p = 0.1331, a little different from the above calculation - in it we 
used the sample mean rounded to one decimal place instead of the data) but it also calculates the 
test statistic (z-score) for the sample mean, the sample mean, and the sample standard deviation. 
\i > 275 is the alternate hypothesis. Do this set of instructions again except arrow to Draw (instead 
of Calculate). Press ENTER. A shaded graph appears with z = 1.112 (test statistic) and p = 0.1331 
(p-value). Make sure when you use Draw that no other equations are highlighted in Y = and the 
plots are turned off. 



Example 9.13 

Statistics students believe that the mean score on the first statistics test is 65. A statistics instructor 
thinks the mean score is higher than 65. He samples ten statistics students and obtains the scores 
65; 65; 70; 67; 66; 63; 63; 68; 72; 71. He performs a hypothesis test using a 5% level of significance. 
The data are from a normal distribution. 

Solution 

Set up the Hypothesis Test: 

A 5% level of significance means that a = 0.05. This is a test of a single population mean. 

H : ]i — 65 H a : ]i > 65 

Since the instructor thinks the average score is higher, use a "> ". The "> " means the test is 
right-tailed. 

Determine the distribution needed: 

Random variable: X = average score on the first statistics test. 

Distribution for the test: If you read the problem carefully, you will notice that there is no pop- 
ulation standard deviation given. You are only given n — 10 sample data values. Notice also 
that the data come from a normal distribution. This means that the distribution for the test is a 
student's-t. 

Use f^f- Therefore, the distribution for the test is £9 where n — 10 and df — 10 — 1 = 9. 

Calculate the p-value using the Student's-t distribution: 

p-value = P ( x > 67 )= 0.0396 where the sample mean and sample standard deviation are 
calculated as 67 and 3.1972 from the data. 

Interpretation of the p-value: If the null hypothesis is true, then there is a 0.0396 probability 
(3.96%) that the sample mean is 67 or more. 
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p-value = 0.0396 




x= 67 
H = 65 



65 67 



Figure 9.4 

Compare a and the p-value: 

Since a. = .05 and p-value = 0.0396. Therefore, a > p-value. 

Make a decision: Since a. > p-value, reject H . 

This means you reject p. = 65. In other words, you believe the average test score is more than 65. 

Conclusion: At a 5% level of significance, the sample data show sufficient evidence that the mean 
(average) test score is more than 65, just as the math instructor thinks. 

The p-value can easily be calculated using the TI-83+ and the TI-84 calculators: 

Put the data into a list. Press STAT and arrow over to TESTS. Press 2:T-Test. Arrow over to 
Data and press ENTER. Arrow down and enter 65 for jIq, the name of the list where you put the 
data, and 1 for Freq:. Arrow down to p, : and arrow over to > po- Press ENTER. Arrow down 
to Calculate and press ENTER. The calculator not only calculates the p-value (p = 0.0396) but it 
also calculates the test statistic (t-score) for the sample mean, the sample mean, and the sample 
standard deviation, ^i > 65 is the alternate hypothesis. Do this set of instructions again except 
arrow to Draw (instead of Calculate). Press ENTER. A shaded graph appears with t = 1.9781 (test 
statistic) and p = 0.0396 (p-value). Make sure when you use Draw that no other equations are 
highlighted in Y = and the plots are turned off. 



Example 9.14 

Joon believes that 50% of first- time brides in the United States are younger than their grooms. 
She performs a hypothesis test to determine if the percentage is the same or different from 50%. 
Joon samples 100 first-time brides and 53 reply that they are younger than their grooms. For the 
hypothesis test, she uses a 1% level of significance. 

Solution 

Set up the Hypothesis Test: 

The 1% level of significance means that a — 0.01. This is a test of a single population proportion. 

H : p = 0.50 H a : p / 0.50 

The words "is the same or different from" tell you this is a two-tailed test. 
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Calculate the distribution needed: 

Random variable: P' = the percent of of first-time brides who are younger than their grooms. 

Distribution for the test: The problem contains no mention of a mean. The information is given 
in terms of percentages. Use the distribution for P' , the estimated proportion. 



P' ~ N ( P, yir ) Therefore, F~N 0.5, ^^^ ) where p = 0.50, q = 1 - p = 0.50, and 

n = 100. 

Calculate the p-value using the normal distribution for proportions: 

p-value = P(p'< 0.47 or p' > 0.53 ) = 0.5485 

where x = 53, p' = § = ^ = 0.53. 

Interpretation of the p-value: If the null hypothesis is true, there is 0.5485 probability (54.85%) 
that the sample (estimated) proportion p' is 0.53 or more OR 0.47 or less (see the graph below). 

-(p-value) = 0.27425 -(p-value) = 0.27425 
2 \ 2 



P 




-i- 
0.47 0.50 0.53 



Figure 9.5 



fi = p = 0.50 comes from H , the null hypothesis. 

p'= 0.53. Since the curve is symmetrical and the test is two-tailed, the p' for the left tail is equal to 
0.50 - 0.03 = 0.47 where \i = p = 0.50. (0.03 is the difference between 0.53 and 0.50.) 

Compare a and the p-value: 

Since a. = 0.01 and p-value = 0.5485. Therefore, oc< p-value. 

Make a decision: Since a<p-value, you cannot reject H . 

Conclusion: At the 1% level of significance, the sample data do not show sufficient evidence that 
the percentage of first-time brides that are younger than their grooms is different from 50%. 

The p-value can easily be calculated using the TI-83+ and the TI-84 calculators: 

Press STAT and arrow over to TESTS. Press 5: 1-PropZTest. Enter .5 for p , 53 for x and 100 for 
n. Arrow down to Prop and arrow to not equals pp- Press ENTER. Arrow down to Calculate 
and press ENTER. The calculator calculates the p-value (p — 0.5485) and the test statistic (z-score). 
Prop not equals .5 is the alternate hypothesis. Do this set of instructions again except arrow to 
Draw (instead of Calculate). Press ENTER. A shaded graph appears with z = 0.6 (test statistic) and 
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p = 0.5485 (p-value). Make sure when you use Draw that no other equations are highlighted in 
Y = and the plots are turned off. 

The Type I and Type II errors are as follows: 

The Type I error is to conclude that the proportion of first-time brides that are younger than their 
grooms is different from 50% when, in fact, the proportion is actually 50%. (Reject the null hy- 
pothesis when the null hypothesis is true). 

The Type II error is there is not enough evidence to conclude that the proportion of first time brides 
that are younger than their grooms differs from 50% when, in fact, the proportion does differ from 
50%. (Do not reject the null hypothesis when the null hypothesis is false.) 



Example 9.15 
Problem 1 

Suppose a consumer group suspects that the proportion of households that have three cell phones 
is 30%. A cell phone company has reason to believe that the proportion is 30%. Before they start 
a big advertising campaign, they conduct a hypothesis test. Their marketing people survey 150 
households with the result that 43 of the households have three cell phones. 

Solution 

Set up the Hypothesis Test: 

H : p = 0.30 H a : p^ 0.30 

Determine the distribution needed: 

The random variable is P' = proportion of households that have three cell phones. 



The distribution for the hypothesis test is P' ~ N ( 0.30, y - ' 15 q ' — - 

Problem 2 

The value that helps determine the p-value is p' . Calculate p' . 

Problem 3 

What is a success for this problem? 

Problem 4 

What is the level of significance? 

Draw the graph for this problem. Draw the horizontal axis. Label and shade appropriately. 

Problem 5 

Calculate the p-value. 

Problem 6 

Make a decision. (Reject/Do not reject) Hg because . 



The next example is a poem written by a statistics student named Nicole Hart. The solution to the problem 
follows the poem. Notice that the hypothesis test is for a single population proportion. This means that the 
null and alternate hypotheses use the parameter p. The distribution for the test is normal. The estimated 
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proportion p' is the proportion of fleas killed to the total fleas found on Fido. This is sample information. 
The problem gives a preconceived a. = 0.01, for comparison, and a 95% confidence interval computation. 
The poem is clever and humorous, so please enjoy it! 

NOTE: Hypothesis testing problems consist of multiple steps. To help you do the problems, so- 
lution sheets are provided for your use. Look in the Table of Contents Appendix for the topic 
"Solution Sheets." If you like, use copies of the appropriate solution sheet for homework prob- 
lems. 

Example 9.16 

My dog has so many fleas, 
They do not come off with ease . 
As for shampoo, I have tried many types 
Even one called Bubble Hype , 
Which only killed 25% of the fleas, 
Unfortunately I was not pleased. 

I've used all kinds of soap, 
Until I had give up hope 
Until one day I saw 
An ad that put me in awe . 

A shampoo used for dogs 

Called GOOD ENOUGH to Clean a Hog 

Guaranteed to kill more fleas. 

I gave Fido a bath 
And after doing the math 
His number of fleas 
Started dropping by 3's! 

Before his shampoo 

I counted 42. 

At the end of his bath, 

I redid the math 

And the new shampoo had killed 17 fleas. 

So now I was pleased. 

Now it is time for you to have some fun 
With the level of significance being .01, 
You must help me figure out 
Use the new shampoo or go without? 

Solution 

Set up the Hypothesis Test: 

H : p = 0.25 H a : p > 0.25 

Determine the distribution needed: 

In words, CLEARLY state what your random variable X or P' represents. 
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P' = The proportion of fleas that are killed by the new shampoo 
State the distribution to use for the test. 



Normal: N 0.25, 



(0.25)(l-0.25) 
42 



Test Statistic: z = 2.3163 

Calculate the p-value using the normal distribution for proportions: 

p-value =0.0103 

In 1 - 2 complete sentences, explain what the p-value means for this problem. 

If the null hypothesis is true (the proportion is 0.25), then there is a 0.0103 probability that the 

17 



sample (estimated) proportion is 0.4048 ( 52 J or more. 

Use the previous information to sketch a picture of this situation. CLEARLY, label and scale the 
horizontal axis and shade the region(s) corresponding to the p-value. 




25 17 42= Test statistic foi 
bAMb 17 42: 2.3163 



Figure 9.6 



Compare a and the p-value: 

Indicate the correct decision ("reject" or "do not reject" the null hypothesis), the reason for it, and 
write an appropriate conclusion, using COMPLETE SENTENCES. 



alpha 


decision 


reason for decision 


0.01 


Do not reject H 


a<p-value 



Table 9.3 

Conclusion: At the 1% level of significance, the sample data do not show sufficient evidence that 
the percentage of fleas that are killed by the new shampoo is more than 25%. 

Construct a 95% Confidence Interval for the true mean or proportion. Include a sketch of the 
graph of the situation. Label the point estimate and the lower and upper bounds of the Confidence 
Interval. 
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0.26 17/42 0.55 



Figure 9.7 



Confidence Interval: (0.26,0.55) We are 95% confident that the true population proportion p of 
fleas that are killed by the new shampoo is between 26% and 55%. 

NOTE: This test result is not very definitive since the p-value is very close to alpha. In reality, one 
would probably do more tests by giving the dog another bath after the fleas have had a chance to 
return. 
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9.12 Summary of Formulas 

H and H a are contradictory. 



12 



If H has: 


equal (=) 


greater than or equal to 

(>) 


less than or equal to 

(<) 


then H a has: 


not equal ( ^ ) or greater 
than (> ) or less than 

(<) 


less than ( < ) 


greater than ( > ) 



Table 9.4 

If a < p-value, then do not reject H . 

If a > p-value, then reject H . 

a is preconceived. Its value is set before the hypothesis test starts. The p-value is calculated from the data. 

a = probability of a Type I error = P(Type I error) = probability of rejecting the null hypothesis when the 
null hypothesis is true. 

/5 = probability of a Type II error = P(Type II error) = probability of not rejecting the null hypothesis when 
the null hypothesis is false. 

If there is no given preconceived a, then use a. = 0.05. 
Types of Hypothesis Tests 

• Single population mean, known population variance (or standard deviation): Normal test. 

• Single population mean, unknown population variance (or standard deviation): Student's-t test. 

• Single population proportion: Normal test. 



2 This content is available online at <http://cnx.org/content/ml6996/1.9/>. 
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9.13 Practice 1: Single Mean, Known Population Standard Deviation 13 

9.13.1 Student Learning Outcomes 

• The student will conduct a hypothesis test of a single mean with known population standard devia- 
tion. 



9.13.2 Given 

Suppose that a recent article stated that the mean time spent in jail by a first-time convicted burglar is 2.5 
years. A study was then done to see if the mean time has increased in the new century. A random sample 
of 26 first-time convicted burglars in a recent year was picked. The mean length of time in jail from the 
survey was 3 years with a standard deviation of 1.8 years. Suppose that it is somehow known that the 
population standard deviation is 1.5. Conduct a hypothesis test to determine if the mean length of jail time 
has increased. The distribution of the population is normal. 

9.13.3 Hypothesis Testing: Single Mean 

Exercise 9.13.1 (Solution on p. 253.) 

Is this a test of means or proportions? 

Exercise 9.13.2 (Solution on p. 253.) 

State the null and alternative hypotheses. 

a. H : 

b. H a : 

Exercise 9.13.3 (Solution on p. 253.) 

Is this a right-tailed, left-tailed, or two-tailed test? How do you know? 

Exercise 9.13.4 (Solution on p. 253.) 

What symbol represents the Random Variable for this test? 

Exercise 9.13.5 (Solution on p. 253.) 

In words, define the Random Variable for this test. 

Exercise 9.13.6 (Solution on p. 253.) 

Is the population standard deviation known and, if so, what is it? 

Exercise 9.13.7 (Solution on p. 253.) 

Calculate the following: 

a. x = 

b. a = 

c. s x = 
A. n = 

Exercise 9.13.8 (Solution on p. 253.) 

Since both cr and s x are given, which should be used? In 1 -2 complete sentences, explain why. 

Exercise 9.13.9 (Solution on p. 253.) 

State the distribution to use for the hypothesis test. 

Exercise 9.13.10 

Sketch a graph of the situation. Label the horizontal axis. Mark the hypothesized mean and the 
sample mean x. Shade the area corresponding to the p-value. 



3 This content is available online at <http://cnx.Org/content/ml7004/l.ll/>. 



„., CHAPTER 9. HYPOTHESIS TESTING: SINGLE MEAN AND SINGLE 

PROPORTION 



Exercise 9.13.11 (Solution on p. 253.) 

Find the p-value. 

Exercise 9.13.12 (Solution on p. 253.) 

At a pre-conceived a = 0.05, what is your: 

a. Decision: 

b. Reason for the decision: 

c. Conclusion (write out in a complete sentence): 



9.13.4 Discussion Questions 

Exercise 9.13.13 

Does it appear that the mean jail time spent for first time convicted burglars has increased? Why 
or why not? 
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9.14 Practice 2: Single Mean, Unknown Population Standard Deviation 
9.14.1 Student Learning Outcomes 

• The student will conduct a hypothesis test of a single mean with unknown population standard de- 
viation. 



9.14.2 Given 

A random survey of 75 death row inmates revealed that the mean length of time on death row is 17.4 years 
with a standard deviation of 6.3 years. Conduct a hypothesis test to determine if the population mean time 
on death row could likely be 15 years. 



(Solution on p. 253.) 
(Solution on p. 253.) 



9.14.3 Hypothesis Testing: Single Mean 

Exercise 9.14.1 

Is this a test of means or proportions? 

Exercise 9.14.2 

State the null and alternative hypotheses. 

a. H : 

b. H a : 

Exercise 9.14.3 

Is this a right-tailed, left-tailed, or two-tailed test? How do you know? 

Exercise 9.14.4 

What symbol represents the Random Variable for this test? 

Exercise 9.14.5 

In words, define the Random Variable for this test. 

Exercise 9.14.6 

Is the population standard deviation known and, if so, what is it? 

Exercise 9.14.7 

Calculate the following: 

a. x = 

b. 6.3 = 

c. n — 



Exercise 9.14.8 

Which test should be used? In 1 -2 complete sentences, explain why. 

Exercise 9.14.9 

State the distribution to use for the hypothesis test. 

Exercise 9.14.10 

Sketch a graph of the situation. Label the horizontal axis. Mark the hypothesized mean and the 
sample mean, x. Shade the area corresponding to the p-value. 



(Solution on p. 253.) 
(Solution on p. 253.) 
(Solution on p. 253.) 
(Solution on p. 253.) 
(Solution on p. 253.) 



(Solution on p. 254.) 
(Solution on p. 254.) 



4 This content is available online at <http://cnx.Org/content/ml7016/l.12/>. 
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Figure 9.8 



Exercise 9.14.11 

Find the p-value. 

Exercise 9.14.12 

At a pre-conceived a = 0.05, what is your: 

a. Decision: 

b. Reason for the decision: 

c. Conclusion (write out in a complete sentence): 



(Solution on p. 254.) 



(Solution on p. 254.) 



9.14.4 Discussion Question 

Does it appear that the mean time on death row could be 15 years? Why or why not? 
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9.15 Practice 3: Single Proportion 15 
9.15.1 Student Learning Outcomes 

• The student will conduct a hypothesis test of a single population proportion. 



9.15.2 Given 

The National Institute of Mental Health published an article stating that in any one-year pe- 
riod, approximately 9.5 percent of American adults suffer from depression or a depressive illness. 
(http://www.nimh.nih.gov/publicat/depression.cfm) Suppose that in a survey of 100 people in a certain 
town, seven of them suffered from depression or a depressive illness. Conduct a hypothesis test to deter- 
mine if the true proportion of people in that town suffering from depression or a depressive illness is lower 
than the percent in the general adult American population. 



(Solution on p. 254.) 
(Solution on p. 254.) 



9.15.3 Hypothesis Testing: Single Proportion 

Exercise 9.15.1 

Is this a test of means or proportions? 

Exercise 9.15.2 

State the null and alternative hypotheses. 

a. H : 

b. H a : 

Exercise 9.15.3 

Is this a right-tailed, left-tailed, or two-tailed test? How do you know? 

Exercise 9.15.4 

What symbol represents the Random Variable for this test? 

Exercise 9.15.5 

In words, define the Random Variable for this test. 

Exercise 9.15.6 

Calculate the following: 

a: x = 
b: n = 
c:p' = 

Exercise 9.15.7 

Calculate <J„>. Make sure to show how you set up the formula. 

Exercise 9.15.8 

State the distribution to use for the hypothesis test. 

Exercise 9.15.9 

Sketch a graph of the situation. Label the horizontal axis. Mark the hypothesized mean and the 
sample proportion, p-hat. Shade the area corresponding to the p-value. 



(Solution on p. 254.) 
(Solution on p. 254.) 
(Solution on p. 254.) 
(Solution on p. 254.) 



(Solution on p. 254.) 
(Solution on p. 254.) 



5 This content is available online at <http://cnx.Org/content/ml7003/l.15/>. 
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Exercise 9.15.10 (Solution on p. 254.) 

Find the p-value 

Exercise 9.15.11 (Solution on p. 254.) 

At a pre-conceived a = 0.05, what is your: 

a. Decision: 

b. Reason for the decision: 

c. Conclusion (write out in a complete sentence): 



9.15.4 Discusion Question 

Exercise 9.15.12 

Does it appear that the proportion of people in that town with depression or a depressive illness 
is lower than general adult American population? Why or why not? 
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9.16 Homework Link 16 

Link to homework questions in Homework Collection /Book for Hypothesis Test Chapter 9 of Collaborative 
Statistics for R. Bloom 

Chapter 9 Homework Problems ( http://cnx.org/content/ml8867/latest/ ) 



16 This content is available online at <http://cnx.Org/content/ml9047/l.l/>. 
17 http://cnx.org/content/ml8867/latest/ 
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9.17 Review Questions Link 18 

Link to Review Questions in Homework Collection/Book for Hypothesis Test Chapter 9 of Collaborative 
Statistics for R. Bloom 

Chapter 9 Review Questions ( http://cnx.org/content/19025/latest/ ) 19 



18 This content is available online at <http://cnx.Org/content/ml9038/l.l/>. 
19 http://cnx.org/content/ml9025/latest/ 
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Solutions to Exercises in Chapter 9 

Solutions to Practice 1: Single Mean, Known Population Standard Deviation 

Solution to Exercise 9.13.1 (p. 245) 
Means 
Solution to Exercise 9.13.2 (p. 245) 

a: H : ji = 2. 5 (or, H : ji < 2.5) 
b: H a : ji > 2.5 

Solution to Exercise 9.13.3 (p. 245) 

right-tailed 
Solution to Exercise 9.13.4 (p. 245) 

X 
Solution to Exercise 9.13.5 (p. 245) 

The mean time spent in jail for 26 first time convicted burglars 
Solution to Exercise 9.13.6 (p. 245) 

Yes, 1.5 

Solution to Exercise 9.13.7 (p. 245) 

a. 3 

b. 1.5 

c. 1.8 

d. 26 

Solution to Exercise 9.13.8 (p. 245) 
a 
Solution to Exercise 9.13.9 (p. 245) 

*~ N ( Z5 't!) 

Solution to Exercise 9.13.11 (p. 246) 

0.0446 

Solution to Exercise 9.13.12 (p. 246) 

a. Reject the null hypothesis 

Solutions to Practice 2: Single Mean, Unknown Population Standard Deviation 

Solution to Exercise 9.14.1 (p. 247) 

averages 

Solution to Exercise 9.14.2 (p. 247) 

a. H : ]i — 15 

b. H a : ]i £ 15 

Solution to Exercise 9.14.3 (p. 247) 

two-tailed 

Solution to Exercise 9.14.4 (p. 247) 

X 
Solution to Exercise 9.14.5 (p. 247) 

the mean time spent on death row for the 26 inmates 
Solution to Exercise 9.14.6 (p. 247) 

No 

Solution to Exercise 9.14.7 (p. 247) 
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a. 17.4 

b. s 

c. 75 

Solution to Exercise 9.14.8 (p. 247) 
f-test 
Solution to Exercise 9.14.9 (p. 247) 

hi 

Solution to Exercise 9.14.11 (p. 248) 
0.0015 
Solution to Exercise 9.14.12 (p. 248) 

a. Reject the null hypothesis 

Solutions to Practice 3: Single Proportion 

Solution to Exercise 9.15.1 (p. 249) 
Proportions 
Solution to Exercise 9.15.2 (p. 249) 

a. H : p = 0.095 

b. H a : p < 0.095 

Solution to Exercise 9.15.3 (p. 249) 

left-tailed 

Solution to Exercise 9.15.4 (p. 249) 
P' 
Solution to Exercise 9.15.5 (p. 249) 

the proportion of people in that town surveyed suffering from depression or a depressive illness 
Solution to Exercise 9.15.6 (p. 249) 

a. 7 

b. 100 

c. 0.07 

Solution to Exercise 9.15.7 (p. 249) 
0.0293 

Solution to Exercise 9.15.8 (p. 249) 
Normal 

Solution to Exercise 9.15.10 (p. 250) 
0.1969 
Solution to Exercise 9.15.11 (p. 250) 

a. Do not reject the null hypothesis 



Chapter 10 

Hypothesis Testing: Two Means, Paired 
Data, Two Proportions 



10.1 Hypothesis Testing: Two Population Means and Two Population 
Proportions 1 

10.1.1 Student Learning Outcomes 

By the end of this chapter, the student should be able to: 

• Classify hypothesis tests by type. 

• Conduct and interpret hypothesis tests for two population means, population standard deviations 
known. 

• Conduct and interpret hypothesis tests for two population means, population standard deviations 
unknown. 

• Conduct and interpret hypothesis tests for two population proportions. 

• Conduct and interpret hypothesis tests for matched or paired samples. 

10.1.2 Introduction 

Studies often compare two groups. For example, researchers are interested in the effect aspirin has in 
preventing heart attacks. Over the last few years, newspapers and magazines have reported about various 
aspirin studies involving two groups. Typically, one group is given aspirin and the other group is given a 
placebo. Then, the heart attack rate is studied over several years. 

There are other situations that deal with the comparison of two groups. For example, studies compare var- 
ious diet and exercise programs. Politicians compare the proportion of individuals from different income 
brackets who might vote for them. Students are interested in whether SAT or GRE preparatory courses 
really help raise their scores. 

In the previous chapter, you learned to conduct hypothesis tests on single means and single proportions. 
You will expand upon that in this chapter. You will compare two means or two proportions to each other. 
The general procedure is still the same, just expanded. 



lr rhis content is available online at <http://cnx.org/content/ml7029/1.9/>. 
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To compare two means or two proportions, you work with two groups. The groups are classified either as 
independent or matched pairs. Independent groups mean that the two samples taken are independent, 
that is, sample values selected from one population are not related in any way to sample values selected 
from the other population. Matched pairs consist of two samples that are dependent. The parameter tested 
using matched pairs is the population mean. The parameters tested using independent groups are either 
population means or population proportions. 

NOTE: This chapter relies on either a calculator or a computer to calculate the degrees of freedom, 
the test statistics, and p-values. TI-83+ and TI-84 instructions are included as well as the test statis- 
tic formulas. When using the TI-83+/TI-84 calculators, we do not need to separate two population 
means, independent groups, population variances unknown into large and small sample sizes. 
However, most statistical computer software has the ability to differentiate these tests. 

This chapter deals with the following hypothesis tests: 
Independent groups (samples are independent) 

• Test of two population means. 

• Test of two population proportions. 

Matched or paired samples (samples are dependent) 

• Becomes a test of one population mean. 



10.2 Comparing Two Independent Population Means with Unknown 
Population Standard Deviations 2 

1. The two independent samples are simple random samples from two distinct populations. 

2. Both populations are normally distributed with the population means and standard deviations un- 
known unless the sample sizes are greater than 30. In that case, the populations need not be normally 
distributed. 

NOTE: The test comparing two independent population means with unknown and possibly un- 
equal population standard deviations is called the Aspin-Welch t-test. The degrees of freedom 
formula was developed by Aspin-Welch. 

The comparison of two population means is very common. A difference between the two samples depends 
on both the means and the standard deviations. Very different means can occur by chance if there is great 
variation among the individual samples. In order to account for the variation, we take the difference of 
the sample means, X\ - X2 , and divide by the standard error (shown below) in order to standardize the 
difference. The result is a t-score test statistic (shown below). 

Because we do not know the population standard deviations, we estimate them using the two sample 
standard deviations from our independent samples. For the hypothesis test, we calculate the estimated 
standard deviation, or standard error, of the difference in sample means, X^ - X2. 

The standard error is: 

(Si£ + (S2£ (101) 

The test statistic (t-score) is calculated as follows: 



2 This content is available online at <http://cnx.Org/content/ml7025/l.18/>. 
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t-score 



(x-i -x 2 ) - Oi -]i 2 ) 



(10.2) 



(Si 



»i 



+ 



(SzT 

"2 



where: 

• Sj and S2/ the sample standard deviations, are estimates of <T\ and c 2 , respectively. 

• C\ and (72 are the unknown population standard deviations. 

• x~{ and x~2 are the sample means. \l\ and ^2 ar e the population means. 

The degrees of freedom (df) is a somewhat complicated calculation. However, a computer or calculator cal- 
culates it easily. The dfs are not always a whole number. The test statistic calculated above is approximated 
by the student's-t distribution with dfs as follows: 

Degrees of freedom 



df 



far , fa) 



+ 



111 



«!-l 



faT 
"1 



+ 



«2 — 1 



faT 
"2 



(10.3) 



When both sample sizes ri\ and M2 are five or larger, the student's-t approximation is very good. Notice that 
the sample variances Sj 2 and S2 2 are not pooled. (If the question comes up, do not pool the variances.) 

NOTE: It is not necessary to compute this by hand. A calculator or computer easily computes it. 

Example 10.1: Independent groups 

The average amount of time boys and girls ages 7 through 11 spend playing sports each day is 
believed to be the same. An experiment is done, data is collected, resulting in the table below. 
Both populations have a normal distribution. 





Sample Size 


Average Number of 

Hours Playing Sports 

Per Day 


Sample Standard 
Deviation 


Girls 


9 


2 hours 


V0.75 


Boys 


16 


3.2 hours 


1.00 



Table 10.1 

Problem 

Is there a difference in the mean amount of time boys and girls ages 7 through 1 1 play sports each 
day? Test at the 5% level of significance. 

Solution 

The population standard deviations are not known. Let g be the subscript for girls and b be the 
subscript for boys. Then, ji„ is the population mean for girls and }i b is the population mean for 
boys. This is a test of two independent groups, two population means. 

Random variable: X ? — X b = difference in the sample mean amount of time girls and boys play 
sports each day. 



H : jig — fi b 



ji g -li b = 
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The words "the same" tell you H has an "=". Since there are no other words to indicate H„, then 
assume "is different." This is a two-tailed test. 

Distribution for the test: Use t&f where df is calculated using the df formula for independent 
groups, two population means. Using a calculator, df is approximately 18.8462. Do not pool the 
variances. 

Calculate the p-value using a student's-t distribution: p-value = 0.0054 

Graph: 



- (p-value) - 0.0028 



- (p-value) = 0.0028 
2 




x„- x 



-1.2 1.2 

From H , \x g - \ib = d 



g~ A b 



Figure 10.1 



v / 075 



s h = 1 



So, Xa — Xf, = 2 — 3.2 



-1.2 



Half the p-value is below -1.2 and half is above 1.2. 

Make a decision: Since a. > p-value, reject H . 

This means you reject jig — ]i\>. The means are different. 

Conclusion: At the 5% level of significance, the sample data show there is sufficient evidence to 
conclude that the mean number of hours that girls and boys aged 7 through 1 1 play sports per day 
is different (mean number of hours boys aged 7 through 11 play sports per day is greater than the 
mean number of hours played by girls OR the mean number of hours girls aged 7 through 1 1 play 
sports per day is greater than the mean number of hours played by boys). 

NOTE: TI-83+ and TI-84: Press STAT. Arrow over to TESTS and press 4 : 2-SampTTest. Arrow over 
to Stats and press ENTER. Arrow down and enter 2 for the first sample mean, \/0.75 for Sxl, 9 
for nl, 3 . 2 for the second sample mean, 1 for Sx2, and 16 for n2. Arrow down to jil: and arrow 
to does not equal }i2. Press ENTER. Arrow down to Pooled: and No. Press ENTER. Arrow down to 
Calculate and press ENTER. The p-value is p = 0.0054, the dfs are approximately 18.8462, and the 
test statistic is -3.14. Do the procedure again but instead of Calculate do Draw. 
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Example 10.2 

A study is done by a community group in two neighboring colleges to determine which one grad- 
uates students with more math classes. College A samples 11 graduates. Their average is 4 math 
classes with a standard deviation of 1.5 math classes. College B samples 9 graduates. Their aver- 
age is 3.5 math classes with a standard deviation of 1 math class. The community group believes 
that a student who graduates from college A has taken more math classes, on the average. Both 
populations have a normal distribution. Test at a 1% significance level. Answer the following 
questions. 

Problem 1 (Solution on p. 275.) 

Is this a test of two means or two proportions? 



Problem 2 

Are the populations standard deviations known or unknown? 

Problem 3 

Which distribution do you use to perform the test? 

Problem 4 

What is the random variable? 



(Solution on p. 275.) 
(Solution on p. 275.) 
(Solution on p. 275.) 



Problem 5 

What are the null and alternate hypothesis? 

Problem 6 

Is this test right, left, or two tailed? 

Problem 7 

What is the p-value? 

Problem 8 

Do you reject or not reject the null hypothesis? 

Conclusion: 

At the 1% level of significance, from the sample data, there is not sufficient evidence to conclude 
that a student who graduates from college A has taken more math classes, on the average, than a 
student who graduates from college B. 



(Solution on p. 275.) 
(Solution on p. 275.) 
(Solution on p. 275.) 
(Solution on p. 275.) 



10.3 Comparing Two Independent Population Means with Known Pop- 
ulation Standard Deviations 3 

Even though this situation is not likely (knowing the population standard deviations is not likely), the 
following example illustrates hypothesis testing for independent means, known population standard de- 
viations. The sampling distribution for the difference between the means is normal and both populations 
must be normal. The random variable is X\ — X 2 . The normal distribution has the following format: 

Normal distribution 



Xi — X? 



N 



u x - u 2 , 



(if , to) 



"i 



+ 



n 2 



(10.4) 



3 This content is available online at <http://cnx.Org/content/ml7042/l.10/>. 
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The standard deviation is: 



The test statistic (z-score) is: 






z = (^-^-(yi-^) (1Q 6) 

"1 "2 



Example 10.3 

independent groups, population standard deviations known: The mean lasting time of 2 com- 
peting floor waxes is to be compared. Twenty floors are randomly assigned to test each wax. Both 
populations have a normal distribution. The following table is the result. 



Wax 


Sample Mean Number of Months Floor Wax Last 


Population Standard Deviation 


1 


3 


0.33 


2 


2.9 


0.36 



Table 10.2 

Problem 

Does the data indicate that wax 1 is more effective than wax 2? Test at a 5% level of significance. 

Solution 

This is a test of two independent groups, two population means, population standard deviations 
known. 

Random Variable: X^ — X2 — difference in the mean number of months the competing floor waxes 
last. 

H : ]i\ < Ji2 

H a \]i x > ji 2 

The words "is more effective" says that wax 1 lasts longer than wax 2, on the average. "Longer" 
is a " > " symbol and goes into H a . Therefore, this is a right-tailed test. 

Distribution for the test: The population standard deviations are known so the distribution is 
normal. Using the formula above, the distribution is: 



X7-X^N o /A /o# + «# 



20 ' 20 

Since }i\ < jij then \i\ — \ii < and the mean for the normal distribution is 0. 
Calculate the p-value using the normal distribution: p-value = 0.1799 
Graph: 
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p-value = 0.1799 




o o.i X| " X2 

From H o : u \ - jj, 2 ^ 

Figure 10.2 

x{ - xi = 3 - 2.9 = 0.1 

Compare ot and the p-value: a = 0.05 and p-value = 0.1799. Therefore, a. < p-value. 

Make a decision: Since a. < p-value, do not reject H . 

Conclusion: At the 5% level of significance, from the sample data, there is not sufficient evidence 
to conclude that the mean time wax 1 lasts is longer (wax 1 is more effective) than the mean time 
wax 2 lasts. 

NOTE: TI-83+ and TI-84: Press STAT. Arrow over to TESTS and press 3 : 2-SampZTest. Arrow over 
to Stats and press ENTER. Arrow down and enter .33 for sigmal, .36 for sigma2, 3 for the first 
sample mean, 20 for nl, 2 . 9 for the second sample mean, and 20 for n2. Arrow down to jil: and 
arrow to > }i2. Press ENTER. Arrow down to Calculate and press ENTER. The p-value is p = 0.1799 
and the test statistic is 0.9157. Do the procedure again but instead of Calculate do Draw. 



10.4 Comparing Two Independent Population Proportions 4 

1. The two independent samples are simple random samples that are independent. 

2. The number of successes is at least five and the number of failures is at least five for each of the 
samples. 

Comparing two proportions, like comparing two means, is common. If two estimated proportions are 
different, it may be due to a difference in the populations or it may be due to chance. A hypothesis test can 
help determine if a difference in the estimated proportions (P^ — Pg ) reflects a difference in the population 
proportions. 

The difference of two proportions follows an approximate normal distribution. Generally, the null hypoth- 
esis states that the two proportions are the same. That is, H : p& = pg. To conduct the test, we use a pooled 
proportion, p c . 



4 This content is available online at <http://cnx.Org/content/ml7043/l.12/>. 
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The pooled proportion is calculated as follows: 



xa + xb 
n A +n B 



(10.7) 



The distribution for the differences is: 



P'a-P't 



N 



0,, 


/*•<!- 


-Pc)- 


(- + -) 



(10.8) 



The test statistic (z-score) is: 



z = 



)Jpe-Q-Pc)-(& + &) 



(10.9) 



Example 10.4: Two population proportions 

Two types of medication for hives are being tested to determine if there is a difference in the 
proportions of adult patient reactions. Twenty out of a random sample of 200 adults given med- 
ication A still had hives 30 minutes after taking the medication. Twelve out of another random 
sample of 200 adults given medication B still had hives 30 minutes after taking the medication. 
Test at a 1% level of significance. 

10.4.1 Determining the solution 



(Solution on p. 275.) 



This is a test of 2 population proportions. 

Problem 

How do you know? 

Let A and B be the subscripts for medication A and medication B. Then p A and pg are the desired 
population proportions. 

Random Variable: 

P'a — P'b = difference in the proportions of adult patients who did not react after 30 minutes to 
medication A and medication B. 

H :Pa = Pb Pa~Pb = 

Ha-PA^PB PA~Pb 7^ 

The words "is a difference" tell you the test is two-tailed. 

Distribution for the test: Since this is a test of two binomial population proportions, the distribu- 
tion is normal: 

„ _ *A+*B _ 20+12 _ n no 1 _ „ _ f) QO 

P c — n A +n B — 200+200 — U - US l P c — U ' V/ 

Therefore, P'a — P'b "- 



N 



0,J(0.08H0.92).(JL + JL) 



P'a — P'b follows an approximate normal distribution. 

Calculate the p-value using the normal distribution: p-value = 0.1404. 
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Estimated proportion for group A: p' A = jp- = ^j = 0.1 
Estimated proportion for group B: p' B = jf- = %k = 0.06 
Graph: 



- (p-value) = 0.0702 1 , . . ntvJtY1 
2 - (p-value) = 0.0702 




From H , p A - p B = 0. 
Figure 10.3 



P' A -P' B = 0.1-0.06 = 0.04. 

Half the p-value is below -0.04 and half is above 0.04. 

Compare a and the p-value: a. = 0.01 and the p-value = 0.1404. a. < p-value. 

Make a decision: Since a. < p-value, do not reject H . 

Conclusion: At a 1% level of significance, from the sample data, there is not sufficient evidence to 
conclude that there is a difference in the proportions of adult patients who did not react after 30 
minutes to medication A and medication B. 

NOTE: TI-83+ and TI-84: Press STAT. Arrow over to TESTS and press 6 : 2-PropZTest. Arrow down 
and enter 20 for xl, 200 for nl, 12 for x2, and 200 for nl. Arrow down to pi: and arrow to not 
equal p2. Press ENTER. Arrow down to Calculate and press ENTER. The p-value is p = 0.1404 
and the test statistic is 1.47. Do the procedure again but instead of Calculate do Draw. 



10.5 Matched or Paired Samples 5 

1. Simple random sampling is used. 

2. Sample sizes are often small. 

3. Two measurements (samples) are drawn from the same pair of individuals or objects. 

4. Differences are calculated from the matched or paired samples. 

5. The differences form the sample that is used for the hypothesis test. 



5 This content is available online at <http://cnx.Org/content/ml7033/l.15/>. 
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6. The matched pairs have differences that either come from a population that is normal or the number of 
differences is sufficiently large so the distribution of the sample mean of differences is approximately 
normal. 

In a hypothesis test for matched or paired samples, subjects are matched in pairs and differences are cal- 
culated. The differences are the data. The population mean for the differences, ji^, is then tested using 
a Student-t test for a single population mean with n — 1 degrees of freedom where n is the number of 
differences. 

The test statistic (t-score) is: 






(10.10) 



Example 10.5: Matched or paired samples 

A study was conducted to investigate the effectiveness of hypnotism in reducing pain. Results 
for randomly selected subjects are shown in the table. The "before" value is matched to an "after" 
value and the differences are calculated. The differences have a normal distribution. 



Subject: 


A 


B 


C 


D 


E 


F 


G 


H 


Before 


6.6 


6.5 


9.0 


10.3 


11.3 


8.1 


6.3 


11.6 


After 


6.8 


2.4 


7.4 


8.5 


8.1 


6.1 


3.4 


2.0 



Table 10.3 

Problem 

Are the sensory measurements, on average, lower after hypnotism? Test at a 5% significance level. 

Solution 

Corresponding "before" and "after" values form matched pairs. (Calculate "sfter" - "before"). 



After Data 


Before Data 


Difference 


6.8 


6.6 


0.2 


2.4 


6.5 


-4.1 


7.4 


9 


-1.6 


8.5 


10.3 


-1.8 


8.1 


11.3 


-3.2 


6.1 


8.1 


-2 


3.4 


6.3 


-2.9 


2 


11.6 


-9.6 



Table 10.4 

The data for the test are the differences: {0.2, -4.1, -1.6, -1.8, -3.2, -2, -2.9, -9.6} 

The sample mean and sample standard deviation of the differences are: 1Q = —3.13 and 

s^ = 2.91 Verify these values. 

Let \i& be the population mean for the differences. We use the subscript d to denote "differences." 
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Random Variable: X d = the mean difference of the sensory measurements 

H : }i d > (10.11) 

There is no improvement, (ji d is the population mean of the differences.) 

H a :}i d <0 (10.12) 

There is improvement. The score should be lower after hypnotism so the difference ought to be 
negative to indicate improvement. 

Distribution for the test: The distribution is a student-t with df — n — 1=8 — 1 — 7. Use /> 
(Notice that the test is for a single population mean.) 

Calculate the p-value using the Student-t distribution: p-value = 0.0095 

Graph: 



p-value = 0.0095 




-3.13 

From H , p-d > 

Figure 10.4 

X d is the random variable for the differences. 

The sample mean and sample standard deviation of the differences are: 

x d = -3.13 

s d = 2.91 

Compare ot and the p-value: oc = 0.05 and p-value = 0.0095. a. > p-value. 

Make a decision: Since a. > p-value, reject H . 

This means that \i d < and there is improvement. 

Conclusion: At a 5% level of significance, from the sample data, there is sufficient evidence to con- 
clude that the sensory measurements, on average, are lower after hypnotism. Hypnotism appears 
to be effective in reducing pain. 

NOTE: For the TT83+ and TT84 calculators, you can either calculate the differences ahead of time 
(after - before) and put the differences into a list or you can put the after data into a first list and 
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the before data into a second list. Then go to a third list and arrow up to the name. Enter 1st list 
name - 2nd list name. The calculator will do the subtraction and you will have the differences in 
the third list. 

NOTE: TI-83+ and TI-84: Use your list of differences as the data. Press STAT and arrow over to 
TESTS. Press 2 :T-Test. Arrow over to Data and press ENTER. Arrow down and enter for }Iq, the 
name of the list where you put the data, and 1 for Freq:. Arrow down to }i: and arrow over to < 
Ho- Press ENTER. Arrow down to Calculate and press ENTER. The p-value is 0.0094 and the test 
statistic is -3.04. Do these instructions again except arrow to Draw (instead of Calculate). Press 
ENTER. 



Example 10.6 

A college football coach was interested in whether the college's strength development class in- 
creased his players' maximum lift (in pounds) on the bench press exercise. He asked 4 of his 
players to participate in a study. The amount of weight they could each lift was recorded before 
they took the strength development class. After completing the class, the amount of weight they 
could each lift was again measured. The data are as follows: 



Weight (in pounds) 


Player 1 


Player 2 


Player 3 


Player 4 


Amount of weighted lifted prior to the class 


205 


241 


338 


368 


Amount of weight lifted after the class 


295 


252 


330 


360 



Table 10.5 

The coach wants to know if the strength development class makes his players stronger, on 
average. 

Problem (Solution on p. 275.) 

Record the differences data. Calculate the differences by subtracting the amount of weight lifted 
prior to the class from the weight lifted after completing the class. The data for the differences are: 
{90, 11, -8, -8}. The differences have a normal distribution. 

Using the differences data, calculate the sample mean and the sample standard deviation. 

x d = 21.3 s d = 46.7 

Using the difference data, this becomes a test of a single (fill in the blank). 

Define the random variable: X d = mean difference in the maximum lift per player. 

The distribution for the hypothesis test is £3. 

H : ]i d < H a :^>0 

Graph: 
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p-value = 0.2150 




Xd 



Figure 10.5 



Calculate the p-value: The p-value is 0.2150 

Decision: If the level of significance is 5%, the decision is to not reject the null hypothesis because 
a. < p-value. 

What is the conclusion? 

Example 10.7 

Seven eighth graders at Kennedy Middle School measured how far they could push the shot-put 
with their dominant (writing) hand and their weaker (non-writing) hand. They thought that they 
could push equal distances with either hand. The following data was collected. 



Distance 

(in feet) 

using 


Student 1 


Student 2 


Student 3 


Student 4 


Student 5 


Student 6 


Student 7 


Dominant 
Hand 


30 


26 


34 


17 


19 


26 


20 


Weaker 
Hand 


28 


14 


27 


18 


17 


26 


16 



Table 10.6 

Problem (Solution on p. 275.) 

Conduct a hypothesis test to determine whether the mean difference in distances between the 
children's dominant versus weaker hands is significant. 

HINT: use a t-test on the difference data. Assume the differences have a normal distribution. The 
random variable is the mean difference. 



CHECK: The test statistic is 2.18 and the p-value is 0.0716. 
What is your conclusion? 
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10.6 Summary of Types of Hypothesis Tests 6 

Two Population Means 

• Populations are independent and population standard deviations are unknown. 

• Populations are independent and population standard deviations are known (not likely). 

Matched or Paired Samples 

• Two samples are drawn from the same set of objects. 

• Samples are dependent. 

Two Population Proportions 

• Populations are independent. 



6 This content is available online at <http://cnx.org/content/ml7044/1.5/>. 
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10.7 Practice 1: Hypothesis Testing for Two Proportions 7 

10.7.1 Student Learning Outcomes 

• The student will conduct a hypothesis test of two proportions. 



10.7.2 Given 

In the recent Census, 3 percent of the U.S. population reported being two or more 
races. However, the percent varies tremendously from state to state. (Source: 

http://www.census.gov/prod/cen2010/briefs/c2010br-02.pdf) Suppose that two random surveys 
are conducted. In the first random survey, out of 1000 North Dakotans, only 9 people reported being of 
two or more races. In the second random survey, out of 500 Nevadans, 17 people reported being of two 
or more races. Conduct a hypothesis test to determine if the population percents are the same for the two 
states or if the percent for Nevada is statistically higher than for North Dakota. 

10.7.3 Hypothesis Testing: Two Proportions 

Exercise 10.7.1 (Solution on p. 275.) 

Is this a test of means or proportions? 

Exercise 10.7.2 (Solution on p. 275.) 

State the null and alternative hypotheses. 

a. H : 

b. H a : 

Exercise 10.7.3 (Solution on p. 275.) 

Is this a right-tailed, left-tailed, or two-tailed test? How do you know? 

Exercise 10.7.4 

What is the Random Variable of interest for this test? 

Exercise 10.7.5 

In words, define the Random Variable for this test. 

Exercise 10.7.6 (Solution on p. 275.) 

Which distribution (Normal or student' s-t) would you use for this hypothesis test? 

Exercise 10.7.7 

Explain why you chose the distribution you did for the above question. 

Exercise 10.7.8 (Solution on p. 275.) 

Calculate the test statistic. 

Exercise 10.7.9 

Sketch a graph of the situation. Mark the hypothesized difference and the sample difference. 
Shade the area corresponding to the p— value. 



7 This content is available online at <http://cnx.Org/content/ml7027/l.13/>. 



27 „ CHAPTER 10. HYPOTHESIS TESTING: TWO MEANS, PAIRED DATA, TWO 

"' PROPORTIONS 



f*N " ND 



Figure 10.6 



Exercise 10.7.10 (Solution on p. 275.) 

Find the p— value: 

Exercise 10.7.11 (Solution on p. 275.) 

At a pre-conceived a = 0.05, what is your: 

a. Decision: 

b. Reason for the decision: 

c. Conclusion (write out in a complete sentence): 



10.7.4 Discussion Question 

Exercise 10.7.12 

Does it appear that the proportion of Nevadans who are two or more races is higher than the 
proportion of North Dakotans? Why or why not? 
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10.8 Practice 2: Hypothesis Testing for Two Averages 8 

10.8.1 Student Learning Outcome 

• The student will conduct a hypothesis test of two means. 

10.8.2 Given 

The U.S. Center for Disease Control reports that the mean life expectancy for whites born in 1900 was 
47.6 years and for nonwhites it was 33.0 years, (http://www.cdc.gov/nchs/data/dvs/nvsr53_06tl2.pdf ) 
Suppose that you randomly survey death records for people born in 1900 in a certain county. Of the 124 
whites, the mean life span was 45.3 years with a standard deviation of 12.7 years. Of the 82 nonwhites, the 
mean life span was 34.1 years with a standard deviation of 15.6 years. Conduct a hypothesis test to see if 
the mean life spans in the county were the same for whites and nonwhites. 

10.8.3 Hypothesis Testing: Two Means 

Exercise 10.8.1 (Solution on p. 276.) 

Is this a test of means or proportions? 

Exercise 10.8.2 (Solution on p. 276.) 

State the null and alternative hypotheses. 

a. H : 

b. H a : 

Exercise 10.8.3 (Solution on p. 276.) 

Is this a right-tailed, left-tailed, or two-tailed test? How do you know? 

Exercise 10.8.4 (Solution on p. 276.) 

What is the Random Variable of interest for this test? 

Exercise 10.8.5 (Solution on p. 276.) 

In words, define the Random Variable of interest for this test. 

Exercise 10.8.6 

Which distribution (Normal or student' s-t) would you use for this hypothesis test? 

Exercise 10.8.7 

Explain why you chose the distribution you did for the above question. 

Exercise 10.8.8 (Solution on p. 276.) 

Calculate the test statistic. 

Exercise 10.8.9 

Sketch a graph of the situation. Label the horizontal axis. Mark the hypothesized difference and 
the sample difference. Shade the area corresponding to the p— value. 



8 This content is available online at <http://cnx.Org/content/ml7039/l.12/>. 
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Figure 10.7 



Exercise 10.8.10 (Solution on p. 276.) 

Find the p— value: 

Exercise 10.8.11 (Solution on p. 276.) 

At a pre-conceived a = 0.05, what is your: 

a. Decision: 

b. Reason for the decision: 

c. Conclusion (write out in a complete sentence): 



10.8.4 Discussion Question 

Exercise 10.8.12 

Does it appear that the means are the same? Why or why not? 
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10.9 Homework Link 9 

Link to homework questions in Homework Collection /Book for Hypothesis Tests (2 means; paired samples; 
2 proportions) Chapter 10 of Collaborative Statistics for R. Bloom 

Chapter 10 Homework Problems ( http://cnx.org/content/ml7023/latest/ ) 10 



9 This content is available online at <http://cnx.Org/content/ml9046/l.l/>. 
10 http://cnx.org/content/ml7023/latest/ 
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10.10 Review Questions Link 11 

Link to Review Questions in Homework Collection /Book for Hypothesis Tests with Two Samples Chapter 
10 of Collaborative Statistics for R. Bloom 



Chapter 10 Review Questions ( http://cnx.org/content/ml9028/latest/ ) 



12 



n This content is available online at <http://cnx.Org/content/ml9039/l.l/>. 
12 http://cnx.org/content/ml9028/latest/ 
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Solutions to Exercises in Chapter 10 

Solution to Example 10.2, Problem 1 (p. 259) 

two means 

Solution to Example 10.2, Problem 2 (p. 259) 

unknown 

Solution to Example 10.2, Problem 3 (p. 259) 

student's-t 

Solution to Example 10.2, Problem 4 (p. 259) 

Xa — Xb 
Solution to Example 10.2, Problem 5 (p. 259) 

• H : fi A <}i B 

• H a :ji A >}i B 

Solution to Example 10.2, Problem 6 (p. 259) 

right 

Solution to Example 10.2, Problem 7 (p. 259) 

0.1928 

Solution to Example 10.2, Problem 8 (p. 259) 

Do not reject. 

Solution to Example 10.4, Problem (p. 262) 

The problem asks for a difference in proportions. 
Solution to Example 10.6, Problem (p. 266) 

means; At a 5% level of significance, from the sample data, there is not sufficient evidence to conclude that 
the strength development class helped to make the players stronger, on average. 
Solution to Example 10.7, Problem (p. 267) 

Hq: fid equals 0; H a : \i& does not equal 0; Do not reject the null; At a 5% significance level, from the 
sample data, there is not sufficient evidence to conclude that the mean difference in distances between the 
children's dominant versus weaker hands is significant (there is not sufficient evidence to show that the 
children could push the shot-put further with their dominant hand). Alpha and the p-value are close so the 
test is not strong. 

Solutions to Practice 1: Hypothesis Testing for Two Proportions 

Solution to Exercise 10.7.1 (p. 269) 

Proportions 

Solution to Exercise 10.7.2 (p. 269) 

a. Ho:pn=pnd 
a - H fl : PN > PND 

Solution to Exercise 10.7.3 (p. 269) 

right-tailed 

Solution to Exercise 10.7.6 (p. 269) 

Normal 

Solution to Exercise 10.7.8 (p. 269) 

3.50 

Solution to Exercise 10.7.10 (p. 270) 

0.0002 

Solution to Exercise 10.7.11 (p. 270) 

a. Reject the null hypothesis 
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Solutions to Practice 2: Hypothesis Testing for Two Averages 

Solution to Exercise 10.8.1 (p. 271) 

Means 

Solution to Exercise 10.8.2 (p. 271) 

a. H : }i w — /i NW 

b. H a : }i w ^ ^ NW 

Solution to Exercise 10.8.3 (p. 271) 
two-tailed 

Solution to Exercise 10.8.4 (p. 271) 
X w — X N w 

Solution to Exercise 10.8.5 (p. 271) 

The difference between the mean life spans of whites and nonwhites. 
Solution to Exercise 10.8.8 (p. 271) 
5.42 

Solution to Exercise 10.8.10 (p. 272) 
0.0000 
Solution to Exercise 10.8.11 (p. 272) 

a. Reject the null hypothesis 



Chapter 11 

Linear Regression and Correlation 

11.1 Linear Regression and Correlation 1 

11.1.1 Student Learning Objectives 

By the end of this chapter, the student should be able to: 

• Discuss basic ideas of linear regression and correlation. 

• Create and interpret a line of best fit. 

• Calculate and interpret the correlation coefficient. 

• Calculate and interpret outliers. 



11.1.2 Introduction 

Professionals often want to know how two or more variables are related. For example, is there a relationship 
between the grade on the second math exam a student takes and the grade on the final exam? If there is a 
relationship, what is it and how strong is the relationship? 

In another example, your income may be determined by your education, your profession, your years of 
experience, and your ability. The amount you pay a repair person for labor is often determined by an initial 
amount plus an hourly fee. These are all examples in which regression can be used. 

The type of data described in the examples is bivariate data - "bi" for two variables. In reality, statisticians 
use multivariate data, meaning many variables. 

In this chapter, you will be studying the simplest form of regression, "linear regression" with one indepen- 
dent variable (x). This involves data that fits a line in two dimensions. You will also study correlation which 
measures how strong the relationship is. 

11.2 Linear Equations 2 

Linear regression for two variables is based on a linear equation with one independent variable. It has the 
form: 

y = a + bx (11.1) 



lr rhis content is available online at <http://cnx.Org/content/ml7089/l.5/>. 
2 This content is available online at <http://cnx.Org/content/ml7086/l.4/>. 
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where a and b are constant numbers. 

x is the independent variable, and y is the dependent variable. Typically, you choose a value to substitute 
for the independent variable and then solve for the dependent variable. 



Example 11.1 

The following examples are linear equations. 



y = 3 + 2x 



(11.2) 



y = -0.01 + 1.2x 



(11.3) 



The graph of a linear equation of the form y = a + bx is a straight line. Any line that is not vertical can be 
described by this equation. 

Example 11.2 




Figure 11.1: Graph of the equation y = — 1 + 2x. 



Linear equations of this form occur in applications of life sciences, social sciences, psychology, business, 
economics, physical sciences, mathematics, and other areas. 

Example 11.3 

Aaron's Word Processing Service (AWPS) does word processing. Its rate is $32 per hour plus a 
$31.50 one-time charge. The total cost to a customer depends on the number of hours it takes to 
do the word processing job. 

Problem 

Find the equation that expresses the total cost in terms of the number of hours required to finish 
the word processing job. 

Solution 

Let x = the number of hours it takes to get the job done. 

Let y = the total cost to the customer. 

The $31.50 is a fixed cost. If it takes x hours to complete the job, then (32) (x) is the cost of the 
word processing only. The total cost is: 
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y = 31.50 + 32x 



11.3 Slope and Y-Intercept of a Linear Equation 3 

For the linear equation y = a + bx, b = slope and a = y-intercept. 

From algebra recall that the slope is a number that describes the steepness of a line and the y-intercept is 
the y coordinate of the point (0, a) where the line crosses the y-axis. 





(a) 



(b) 



(c) 



Figure 11.2: Three possible graphs of y = a + bx. (a) If b > 0, the line slopes upward to the right, (b) If 
b = 0, the line is horizontal, (c) lib < 0, the line slopes downward to the right. 



Example 11.4 

Svetlana tutors to make extra money for college. For each tutoring session, she charges a one 
time fee of $25 plus $15 per hour of tutoring. A linear equation that expresses the total amount of 
money Svetlana earns for each session she tutors is y = 25 + 15x. 

Problem 

What are the independent and dependent variables? What is the y-intercept and what is the 
slope? Interpret them using complete sentences. 

Solution 

The independent variable (x) is the number of hours Svetlana tutors each session. The dependent 
variable (y) is the amount, in dollars, Svetlana earns for each session. 

The y-intercept is 25 (a = 25). At the start of the tutoring session, Svetlana charges a one-time fee 
of $25 (this is when x = 0). The slope is 15 (b = 15). For each session, Svetlana earns $15 for each 
hour she tutors. 



3 This content is available online at <http://cnx.org/content/ml7083/1.5/>. 
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11.4 Scatter Plots 4 

Before we take up the discussion of linear regression and correlation, we need to examine a way to display 
the relation between two variables x and y. The most common and easiest way is a scatter plot. The 
following example illustrates a scatter plot. 

Example 11.5 

From an article in the Wall Street Journal : In Europe and Asia, m-commerce is becoming more 
popular. M-commerce users have special mobile phones that work like electronic wallets as well as 
provide phone and Internet services. Users can do everything from paying for parking to buying 
a TV set or soda from a machine to banking to checking sports scores on the Internet. In the next 
few years, will there be a relationship between the year and the number of m-commerce users? 
Construct a scatter plot. Let x = the year and let y = the number of m-commerce users, in millions. 



x (year) 


y (# of users) 


2000 


0.5 


2002 


20.0 


2003 


33.0 


2004 


47.0 




(a) 



Figure 11.3: (a) Table showing the number of m-commerce users (in millions) by year, (b) Scatter plot 
showing the number of m-commerce users (in millions) by year. 



A scatter plot shows the direction and strength of a relationship between the variables. A clear direction 
happens when there is either: 

• High values of one variable occurring with high values of the other variable or low values of one 
variable occurring with low values of the other variable. 

• High values of one variable occurring with low values of the other variable. 

You can determine the strength of the relationship by looking at the scatter plot and seeing how close the 
points are to a line, a power function, an exponential function, or to some other type of function. 

When you look at a scatterplot, you want to notice the overall pattern and any deviations from the pattern. 
The following scatterplot examples illustrate these concepts. 



4 This content is available online at <http://cnx.Org/content/ml7082/l.6/>. 
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(a) Positive Linear Pattern (Strong) (b) Linear Pattern w/ One Deviation 

Figure 11.4 





(a) Negative Linear Pattern (Strong) (b) Negative Linear Pattern (Weak) 

Figure 11.5 
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(a) Exponential Growth Pattern 



(b) No Pattern 



Figure 11.6 



In this chapter, we are interested in scatter plots that show a linear pattern. Linear patterns are quite com- 
mon. The linear relationship is strong if the points are close to a straight line. If we think that the points 
show a linear relationship, we would like to draw a line on the scatter plot. This line can be calculated 
through a process called linear regression. However, we only calculate a regression line if one of the vari- 
ables helps to explain or predict the other variable. If x is the independent variable and y the dependent 
variable, then we can use a regression line to predict y for a given value of x. 
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11.5 The Regression Equation (modified R. Bloom) 5 

11.5.1 Understanding the Regression Equation 

Data rarely fit a straight line exactly. Usually, you must be satisfied with rough predictions. Typically, you 
have a set of data whose scatter plot appears to "fit" a straight line. This is called a Line of Best Fit or Least 
Squares Line. 

Example 11.6 

A random sample of 1 1 statistics students produced the following data where X is the third exam 
score, out of 80, and y is the final exam score, out of 200. Can you predict the final exam score of a 
random student if you know the third exam score? 



x (third exam score) 


y (final exam score) 


65 


175 


67 


133 


71 


185 


71 


163 


66 


126 


75 


198 


67 


153 


70 


163 


71 


159 


69 


151 


69 


159 



(a) 



250 








Exam Score 

s s s 








1 50 

IL 




















i i i 




60 


65 70 75 
Third Exam Score 


80 



(b) 

Figure 11.7: (a) Table showing the scores on the final exam based on scores from the third exam, (b) Scatter 
plot showing the scores on the final exam based on scores from the third exam. 



The third exam score, x, is the independent variable and the final exam score, y, is the dependent variable. 



5 This content is available online at <http://cnx.org/content/m33267/1.2/>. 
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We will plot a regression line that best "fits" the data. If each of you were to fit a line "by eye", you would 
draw different lines. We can use what is called a least-squares regression line to obtain the best fit line. 

Consider the following diagram. Each point of data is of the the form (x, y)and each point of the line of 

( A \ 
best fit using least-squares linear regression has the form \x, J/ . 



The y is read "y hat" and is the estimated value of y. It is the value of y obtained using the regression line. 
It is not generally equal to the observed y from data. 



data point = (x^y,) 



distance = |y t - yj = e„ 



point on line = (x fl , yj 




Figure 11.8 



A A 

The term y — y is called the residual. It is the observed y value — the predicted y value. It can also be 
called the "error". It is not an error in the sense of a mistake, but measures the vertical distance between the 

A 

observed value y and the estimated value y . In other words, it measures the vertical distance between the 
actual data point and the predicted point on the line. 

If the observed data point lies above the line, the residual is positive, and the line underestimates the 
actual data value for y. In the observed data point lies below the line, the residual is negative, and the line 
overestimates that actual data value for y. 

A 

In the Figure 2 diagram above, yo ~~ Vo ~ £ is * ne residual for the point shown. Here the point lies above 
the line and the residual is positive. 

e = the Greek letter epsilon 



For each data point, you can calculate the residuals or errors, y, — y, = e, for i — 1, 2, 3, ..., 11. 
Each e is a vertical distance. 
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For the example about the third exam scores and the final exam scores for the 11 statistics students, there 
are 11 data points. Therefore, there are lie values. If you square each e and add, you get 

{e l ) 2 +{e 2 ) 2 + ... + {e n ) 2 = l i e 2 

i = i 

This is called the Sum of Squared Errors (SSE). 

Using calculus, you can determine the values of a and b that make the SSE a minimum. When you make 
the SSE a minimum, you have determined the points that are on the line of best fit. It turns out that the line 
of best fit has the equation: 

A 

y=a+b\ (11.4) 

where a = y — b ■ x and b = afx-arT • 

x and y are the averages of the x values and the y values, respectively. The best fit line always passes 
through the point (x, y) . 



The slope b can be written as b = r ■ I -^ I where Sy = the standard deviation of the y values and s x = the 
standard deviation of the x values, r is the correlation coefficient which is discussed in the next section. 

Least Squares Criteria for Best Fit 

The process of fitting the best fit line is called linear regression . The idea behind finding the best fit line is 
based on the assumption that the data are scattered about a straight line. The criteria for the best fit line is 
that the sum of the squared errors (SSE) is minimized, that is made as small as possible. Any other line you 
might choose would have a higher SSE than the best fit line. This best fit line is called the least squares 
regression line . 

NOTE: Computer spreadsheets, statistical software, and many calculators can quickly calculate the 
best fit line and create the graphs. The calculations tend to be tedious if done by hand. Instructions 
to use the TT83, TI-83+, and TI-84+ calculators to find the best fit line and create a scatterplot are 
shown at the end of this section. 

THIRD EXAM vs FINAL EXAM EXAMPLE: 

The graph of the line of best fit for the third exam/final exam example is shown below: 
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64 



69 
Third Exam Score 



74 



Figure 11.9 



The least squares regression line (best fit line) for the third exam/final exam example has the equation: 



y= -173.51 +4.83x 



(11.5) 



NOTE: 

Remember, it is always important to plot a scatter diagram first. If the scatter plot indicates that 
there is a linear relationship between the variables, then it is reasonable to use a best fit line 
to make predictions for y given x within the domain of x -values in the sample data, but not 
necessarily for x-values outside that domain. 

You could use the line to predict the final exam score for a student who earned a grade of 73 on 
the third exam. 

You should NOT use the line to predict the final exam score for a student who earned a grade of 
50 on the third exam, because 50 is not within the domain of the x-values in the sample data, 
which are between 65 and 75. 

UNDERSTANDING SLOPE 

The slope of the line, b, describes how changes in the variables are related. It is important to interpret 
the slope of the line in the context of the situation represented by the data. You should be able to write a 
sentence interpreting the slope in plain English. 

INTERPRETATION OF THE SLOPE: The slope of the best fit line tells us how the dependent variable (y) 
changes for every one unit increase in the independent (x) variable, on average. 

THIRD EXAM vs FINAL EXAM EXAMPLE 

Slope: The slope of the line is b = 4.83. 

Interpretation: For a one point increase in the score on the third exam, the final exam score increases by 
4.83 points, on average. 



286 



CHAPTER 1 1 . LINEAR REGRESSION AND CORRELATION 



11.5.2 Using the TI-83+ and TI-84+ Calculators 

Using the Linear Regression T Test: LinRegTTest 

Step 1. In the STAT list editor, enter the X data in list LI and the Y data in list L2, paired so that the corre- 
sponding (x,y) values are next to each other in the lists. (If a particular pair of values is repeated, enter 
it as many times as it appears in the data.) 

Step 2. On the STAT TESTS menu, scroll down with the cursor to select the LinRegTTest. (Be careful to select 
LinRegTTest as some calculators may also have a different item called LinRegTInt.) 

Step 3. On the LinRegTTest input screen enter: Xlist: LI ; Ylist: L2 ; Freq: 1 

Step 4. On the next line, at the prompt /3 or p, highlight "7^ 0" and press ENTER 

Step 5. Leave the line for "RegEq:" blank 

Step 6. Highlight Calculate and press ENTER. 



LinRegTTest Input Screen and Output Screen 



LinRegTTest 
Xlist: L1 
Ylist: L2 
Freq: 1 
fi orp 

RegEQ: 
Calculate 



?*o <o >o 



Tl^83+ and TI-84+ 
calculators 



LinRegTTest 
y = a + bx 
/J^Oand/^0 
t = 2.657560155 
p = . 0261501512 
df = 9 
4,a = -173.513363 
b = 4.827394209 
s= 16.41237711 
r 2 = .4396931 104 
r=. 663093591 



Figure 11.10 



The output screen contains a lot of information. For now we will focus on a few items from the output, and 
will return later to the other items. 

The second line says y=a+bx. Scroll down to find the values a=-173.513, and b=4.8273 ; the equation of the 

A 

best fit line is y= -173.51 + 4.83x 
The two items at the bottom are r 2 = .43969 and r=.663. For now, just note where to find these values; we 
will discuss them in the next two sections. 



Graphing the Scatterplot and Regression Line 

Step 1. We are assuming your X data is already entered in list LI and your Y data is in list L2 

Step 2. Press 2nd STATPLOT ENTER to use Plot 1 

Step 3. On the input screen for PLOT 1, highlight On and press ENTER 

Step 4. For TYPE: highlight the very first icon which is the scatterplot and press ENTER 

Step 5. Indicate Xlist: LI and Ylist: L2 

Step 6. For Mark: it does not matter which symbol you highlight. 
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Step 7. Press the ZOOM key and then the number 9 (for menu item "ZoomStat") ; the calculator will fit the 

window to the data 
Step 8. To graph the best fit line, press the "Y=" key and type the equation -173.5+4.83X into equation Yl. 

(The X key is immediately left of the STAT key). Press ZOOM 9 again to graph it. 
Step 9. Optional: If you want to change the viewing window, press the WINDOW key. Enter your desired 

window using Xmin, Xmax, Ymin, Ymax 



11.6 Correlation Coefficient and Coefficient of Determination 6 

11.6.1 The Correlation Coefficient r 

Besides looking at the scatter plot and seeing that a line seems reasonable, how can you tell if the line is a 
good predictor? Use the correlation coefficient as another indicator (besides the scatterplot) of the strength 
of the relationship between x and y. 

The correlation coefficient, r, developed by Karl Pearson in the early 1900s, is a numerical measure of the 
strength of association between the independent variable x and the dependent variable y 



The correlation coefficient is calculated as r = 



n-Y.x-y-(Lx)-{Y.y) 



y[K.£*M^) 2 H"^i/ 2 -(%) 2 ] 

where n = the number of data points. 

If you suspect a linear relationship between x and y, then r can measure how strong the linear relationship 

is. 

What the VALUE of r tells us: 

• The value of r is always between -1 and +1: — 1 < r < 1. 

• The closer the correlation coefficient r is to -1 or 1 (and the further from 0), the stronger the evidence 
of a significant linear relationship between x and y; this would indicate that the observed data points 
fit more closely to the best fit line. Values of r further from indicate a stronger linear relationship 
between x and y. Values of r closer to indicate a weaker linear relationship between x and y. 

• If r = there is absolutely no linear relationship between x and y (no linear correlation). 

• If r — 1, there is perfect positive correlation. If r — — 1, there is perfect negative correlation. In both 
these cases, all of the original data points lie on a straight line. Of course, in the real world, this will 
not generally happen. 

What the SIGN of r tells us 

• A positive value of r means that when x increases, y increases and when x decreases, y decreases 
(positive correlation). 

• A negative value of r means that when x increases, y decreases and when x decreases, y increases 
(negative correlation). 

• The sign of r is the same as the sign of the slope, b, of the best fit line. 

NOTE: Strong correlation does not suggest that x causes y or y causes x. We say "correlation does 
not imply causation." For example, every person who learned math in the 17th century is dead. 
However, learning math does not necessarily cause death! 



6 This content is available online at <http://cnx.org/content/m33269/1.2/>. 
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O 
/ O O 

0/ O 

o y o 
o 



(a) Positive Correlation (b) Negative Correlation (c) Zero Correlation 

Figure 11.11: (a) A scatter plot showing data with a positive correlation. < r < 1 (b) A scatter plot 
showing data with a negative correlation. — 1 < r < (c) A scatter plot showing data with zero correlation. 
r=0 



The formula for r looks formidable. However, computer spreadsheets, statistical software, and many cal- 
culators can quickly calculate r. The correlation coefficient r is the bottom item in the output screens for the 
LinRegTTest on the TI-83, TI-83+, or TI-84+ calculator (see previous section for instructions). 

11.6.2 The Coefficient of Determination 

r 2 is called the coefficient of determination, r 2 is the square of the correlation coefficient , but is usually 
stated as a percent, rather than in decimal form, r 2 has an interpretation in the context of the data 

• r 2 , when expressed as a percent, represents the percent of variation in the dependent variable y that 
can be explained by variation in the independent variable x using the regression (best fit) line. 

• 1-r , when expressed as a percent, represents the percent of variation in y that is NOT explained by 
variation in x using the regression line. This can be seen as the scattering of the observed data points 
about the regression line. 

Consider the third exam/final exam example introduced in the previous section 

A 

The line of best fit is: V- -173.51 + 4.83x 

The correlation coefficient is r = 0.6631 

The coefficient of determination is r 2 = 0.6631 2 = 0.4397 
Interpretation of r 2 in the context of this example: 

Approximately 44% of the variation in the final exam grades can be explained by the variation in the 
grades on the third exam, using the best fit regression line. 

Therefore approximately 56% of the variation in the final exam grades can NOT be explained by the vari- 
ation in the grades on the third exam, using the best fit regression line. (This is seen as the scattering 
of the points about the line.) 

In the next section, we will learn more about the correlation coefficient and will examine r in the context of 
the example about grades on the third exam and final exam. 



289 

11.7 Facts About the Correlation Coefficient for Linear Regression (mod- 
ified R. Bloom) 7 

11.7.1 Testing the Significance of the Correlation Coefficient 

The correlation coefficient, r, tells us about the strength of the linear relationship between x and y However, 
the reliability of the linear model also depends on how many observed data points are in the sample. We 
need to look at both the value of the correlation coefficient r and the sample size n, together. 

We perform a hypothesis test of the "significance of the correlation coefficient" to decide whether the 
linear relationship in the sample data is strong enough and reliable enough to use to model the relationship 
in the population. 

The sample data is used to compute r, the correlation coefficient for the sample. IF we had data for the 
entire population, we could find the population correlation coefficient. But because we only have sample 
data, we can not calculate the population correlation coefficient. The sample correlation coefficient, r, is our 
estimate of the unknown population correlation coefficient. 

The symbol for the population correlation coefficient is p, the Greek letter "rho". 

p = population correlation coefficient (unknown) 

r = sample correlation coefficient (known; calculated from sample data) 

The hypothesis test lets us decide whether the value of the population correlation coefficient p is "close to 
0" or "significantly different from 0". We decide this based on the sample correlation coefficient r and the 
sample size n. 

If the test concludes that the correlation coefficient is significantly different from 0, we say that the 
correlation coefficient is "significant". 

• Conclusion: "The correlation coefficient IS SIGNIFICANT" 

• What the conclusion means: We believe that there is a significant linear relationship between x and y 
We can use the regression line to model the linear relationship between x and y in the population. 

If the test concludes that the correlation coefficient is not significantly different from (it is close to 0), 
we say that correlation coefficient is "not significant". 

• Conclusion: "The correlation coefficient IS NOT SIGNIFICANT." 

• What the conclusion means: We do NOT believe that there is a significant linear relationship between 
x and y Therefore we can NOT use the regression line to model a linear relationship between x and y 
in the population. 

NOTE: 

• If r is significant and the scatter plot shows a reasonable linear trend, the line can be used to 
predict the value of y for values of x that are within the domain of observed x values. 

• If r is not significant OR if the scatter plot does not show a reasonable linear trend, the line 
should not be used for prediction. 

• If r is significant and if the scatter plot shows a reasonable linear trend, the line may NOT be 
appropriate or reliable for prediction OUTSIDE the domain of observed x values in the data. 

PERFORMING THE HYPOTHESIS TEST 
SETTING UP THE HYPOTHESES: 



7 This content is available online at <http://cnx.org/content/m33270/1.2/>. 
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• Null Hypothesis: Ho: p=0 

• Alternate Hypothesis: Ha: p^O 

What the hypotheses mean in words: 

• Null Hypothesis Ho: The population correlation coefficient IS NOT significantly different from 0. 
There IS NOT a significant linear relationship(correlation) between x and y in the population. 

• Alternate Hypothesis Ha: The population correlation coefficient IS significantly DIFFERENT FROM 
0. There IS A SIGNIFICANT LINEAR RELATIONSHIP (correlation) between x and y in the popula- 
tion. 

DRAWING A CONCLUSION: 

There are two methods to make the decision. Both methods are equivalent and give the same result. 

Method 1: Using the p-value 

Method 2: Using a table of critical values 

In this chapter of this textbook, we will always use a significance level of 5%, a = 0.05 
Note: Using the p-value method, you could choose any appropriate significance level you want; you are 
not limited to using a. = 0.05. But the table of critical values provided in this textbook assumes that 
we are using a significance level of 5%, a. = 0.05. (If we wanted to use a different significance level 
than 5% with the critical value method, we would need different tables of critical values that are not 
provided in this textbook.) 

METHOD 1: Using a p-value to make a decision 

The linear regression t-test LinRegTTEST on the TI-83+ or TI-84+ calculators calculates the p-value. 
On the LinRegTTEST input screen, on the line prompt for j6 or p, highlight "7^ 0" 
The output screen shows the p-value on the line that reads "p=". 
(Most computer statistical software can calculate the p-value.) 

If the p-value is less than the significance level (a = 0.05): 

• Decision: REJECT the null hypothesis. 

• Conclusion: "The correlation coefficient IS SIGNIFICANT." 

• We believe that there IS a significant linear relationship between x and y. because the correlation 
coefficient is significantly different from 0. 

If the p-value is NOT less than the significance level (a = 0.05) 

• Decision: DO NOT REJECT the null hypothesis. 

• Conclusion: "The correlation coefficient is NOT significant." 

• We believe that there is NOT a significant linear relationship between x and y because the correlation 
coefficient is NOT significantly different from 0. 

Calculation Notes: 

You will use technology to calculate the p-value. The following describe the calculations to compute the 

test statistics and the p-value: 
The p-value is calculated using a £ -distribution with n-2 degrees of freedom. 

The formula for the test statistic is t = r )— ^ • The value of the test statistic, t , is shown in the computer 

or calculator output along with the p-value. The test statistic t has the same sign as the correlation 
coefficient r. 
The p-value is the probability (area) in both tails further out beyond the values -t and t . 
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For the TI-83+ and TI-84+ calculators, the command 2*tcdf(abs(t),10 A 99, n-2) computes the p-value given 
by the LinRegTTest; abs(t) denotes absolute value: 1 1 1 

THIRD EXAM vs FINAL EXAM EXAMPLE: p value method 

• Consider the third exam/final exam example. 

A 

• The line of best fit is: J/= —173.51 + 4.83x with r = 0.6631 and there are n = 11 data points. 

• Can the regression line be used for prediction? Given a third exam score (x value), can we use the 
line to predict the final exam score (predicted y value)? 

Ho: p = 
Ha: p ^ 
a = 0.05 

The p-value is 0.026 (from LinRegTTest on your calculator or from computer software) 
The p-value, 0.026, is less than the significance level of a. = 0.05 
Decision: Reject the Null Hypothesis Ho 
Conclusion: The correlation coefficient IS SIGNIFICANT. 

Because r is significant and the scatter plot shows a reasonable linear trend, the regression line can be 
used to predict final exam scores. 

METHOD 2: Using a table of Critical Values to make a decision 

The 95% Critical Values of the Sample Correlation Coefficient Table (Section 11.11) at the end of this 
chapter (before the Summary (Section 11.10)) may be used to give you a good idea of whether the com- 
puted value of r is significant or not. Compare r to the appropriate critical value in the table. If r is not 
between the positive and negative critical values, then the correlation coefficient is significant. If r is signif- 
icant, then you may want to use the line for prediction. 

Example 11.7 

Suppose you computed r = 0.801 using n — 10 data points, df = n — 2 — 10 — 2 = 8. The 
critical values associated with df = 8 are -0.632 and + 0.632. If r< negative critical value or r > 
positive critical value, then r is significant. Since r = 0.801 and 0.801 > 0.632, r is significant and 
the line may be used for prediction. If you view this example on a number line, it will help you. 

[ ] 



-1 -0.632 +0.632 +0.801 +1 



Figure 11.12: r is not significant between -0.632 and +0.632. r = 0.801 > + 0.632. Therefore, r is significant. 



Example 11.8 

Suppose you computed r = —0.624 with 14 data points, df = 14 — 2 = 12. The critical values are 
-0.532 and 0.532. Since — 0.624<— 0.532, r is significant and the line may be used for prediction 



-0.624 -O.S32 +0.532 



Figure 11.13: r = — 0.624<— 0.532. Therefore, r is significant. 
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Example 11.9 

Suppose you computed r — 0.776 and n = 6. df = 6 — 2 — 4. The critical values are -0.811 
and 0.811. Since — 0.811 < 0.776 < 0.811, r is not significant and the line should not be used for 
prediction. 



-0.811 0.776 0.811 



Figure 11.14: -0.811<r = 0.776<0.811. Therefore, r is not significant. 



THIRD EXAM vs FINAL EXAM EXAMPLE: critical value method 

• Consider the third exam/final exam example. 



A 



• The line of best fit is: J/= —173.51 + 4.83x with r — 0.6631 and there are n = 11 data points. 

• Can the regression line be used for prediction? Given a third exam score (x value), can we use the 
line to predict the final exam score (predicted y value)? 

Ho: p = 
Ha: p £ 
oc = 0.05 

Use the "95% Critical Value" table for r with df = n - 2 = 11-2-9 
The critical values are -0.602 and +0.602 
Since 0.6631 > 0.602, r is significant. 
Decision: Reject Ho 

Conclusion: The correlation coefficient is significant 

Because r is significant and the scatter plot shows a reasonable linear trend, the regression line can be 
used to predict final exam scores. 

Example 11.10: Additional Practice Examples using Critical Values 

Suppose you computed the following correlation coefficients. Using the table at the end of the 
chapter, determine if r is significant and the line of best fit associated with each r can be used to 
predict a y value. If it helps, draw a number line. 

1. r = —0.567 and the sample size, n, is 19. The df = n — 2 = 17. The critical value is -0.456. 
— 0.567<— 0.456 so r is significant. 

2. r = 0.708 and the sample size, n, is 9. The df = n — 2 — 7. The critical value is 0.666. 
0.708 > 0.666 so r is significant. 

3. r = 0.134 and the sample size, n, is 14. The df = 14 — 2 = 12. The critical value is 0.532. 
0.134 is between -0.532 and 0.532 so r is not significant. 

4. r — and the sample size, n, is 5. No matter what the dfs are, r = is between the two 
critical values so r is not significant. 
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11.7.2 Assumptions in Testing the Significance of the Correlation Coefficient 

Testing the significance of the correlation coefficient requires that certain assumptions about the data are 
satisfied. The premise of this test is that the data are a sample of observed points taken from a larger 
population. We have not examined the entire population because it is not possible or feasible to do so. We 
are examining the sample to draw a conclusion about whether the linear relationship that we see between 
x and y in the sample data provides strong enough evidence so that we can conclude that there is a linear 
relationship between x and y in the population. 

The regression line equation that we calculate from the sample data gives the best fit line for our particular 
sample. We want to use this best fit line for the sample as an estimate of the best fit line for the population. 
Examining the scatterplot and testing the significance of the correlation coefficient helps us determine if it 
is appropriate to do this. 

The assumptions underlying the test of significance are: 

• There is a linear relationship in the population that models the average value of y for varying values 
of x. In other words, the average of the y values for each particular x value lie on a straight line 
in the population. (We do not know the equation for the line for the population. Our regression line 
from the sample is our best estimate of this line in the population.) 

• The y values for any particular x value are normally distributed about the line. This implies that 
there are more y values scattered closer to the line than are scattered farther away. Assumption (1) 
above implies that these normal distributions are centered on the line: the means of these normal 
distributions of y values lie on the line. 

• The standard deviations of the population y values about the line are the equal for each value of x. In 
other words, each of these normal distributions of y values has the same shape and spread about the 
line. 





Figure 11.15: The y values for each x value are normally distributed about the line with the same standard 
deviation. For each x value, the mean of the y values lies on the regression line. More y values lie near the 
line than are scattered further away from the line. 



294 CHAPTER 1 1 . LINEAR REGRESSION AND CORRELATION 

11.8 Prediction (modified R. Bloom) 8 

Recall the third exam /final exam example. 

We examined the scatterplot and showed that the correlation coefficient is significant. We found the equa- 
tion of the best fit line for the final exam grade as a function of the grade on the third exam. We can now 
use the least squares regression line for prediction. 

Suppose you want to estimate, or predict, the final exam score of statistics students who received 73 on the 
third exam. The exam scores (x-values) range from 65 to 75. Since 73 is between the x-values 65 and 75, 
substitute x = 73 into the equation. Then: 

A 

V= -173.51 + 4.83 (73) = 179.08 (11.7) 

We predict that statistic students who earn a grade of 73 on the third exam will earn a grade of 179.08 on 
the final exam, on average. 

Remember: Do not use the regression equation for prediction outside the domain of observed x values in 
the data. 

Example 11.11 

Recall the third exam /final exam example. 

Problem 1 

What would you predict the final exam score to be for a student who scored a 66 on the third 
exam? 

Solution 

145.27 



Problem 2 (Solution on p. 306.) 

What would you predict the final exam score to be for a student who scored a 78 on the third 
exam? 



11.9 Outliers (modified R. Bloom) 9 

In some data sets, there are values (observed data points) called outliers. Outliers are observed data 
points that are far from the least squares line. They have large "errors", where the "error" or residual is the 
vertical distance from the line to the point. 

Outliers need to be examined closely. Sometimes, for some reason or another, they should not be included 
in the analysis of the data. It is possible that an outlier is a result of erroneous data. Other times, an outlier 
may hold valuable information about the population under study and should remain included in the data. 
The key is to carefully examine what causes a data point to be an outlier. 

Besides outliers, a sample may contain one or a few points that are called influential points. Influential 
points are observed data points that are far from the other observed data points but that greatly influence 
the line. As a result an influential point may be close to the line, even though it is far from the rest of the 



8 This content is available online at <http://cnx.Org/content/m33268/l. l/>. 
9 This content is available online at <http://cnx.Org/content/m33271/l. l/>. 
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data. Because an influential point so strongly influences the best fit line, it generally will not have a large 
"error" or residual. 

Computers and many calculators can be used to identify outliers from the data. Computer output for 
regression analysis will often identify both outliers and influential points so that you can examine them. 

Identifying Outliers 

We could guess at outliers by looking at a graph of the scatterplot and best fit line. However we would like 
some guideline as to how far away a point needs to be in order to be considered an outlier. As a rough rule 
of thumb, we can flag any point that is located further than two standard deviations above or below the 
best fit line as an outlier. The standard deviation used is the standard deviation of the residuals or errors. 

We can do this visually in the scatterplot by drawing an extra pair of lines that are two standard deviations 
above and below the best fit line. Any data points that are outside this extra pair of lines are flagged as 
potential outliers. Or we can do this numerically by calculating each residual and comparing it to twice the 
standard deviation. On the TI-83, 83+, or 84+, the graphical approach is easier. The graphical procedure 
is shown first, followed by the numerical calculations. You would generally only need to use one of these 
methods. 

Example 11.12 

In the third exam/final exam example, you can determine if there is an outlier or not. If there is 
an outlier, as an exercise, delete it and fit the remaining data to a new line. For this example, the 
new line ought to fit the remaining data better. This means the SSE should be smaller and the 
correlation coefficient ought to be closer to 1 or -1 . 

Solution 

Graphical Identification of Outliers 

With the TI-83,83+,84+ graphing calculators, it is easy to identify the outlier graphically and visu- 
ally. If we were to measure the vertical distance from any data point to the corresponding point 
on the line of best fit and that distance was equal to 2s or farther, then we would consider the data 
point to be "too far" from the line of best fit. We need to find and graph the lines that are two 
standard deviations below and above the regression line. Any points that are outside these two 
lines are outliers. We will call these lines Y2 and Y3: 

As we did with the equation of the regression line and the correlation coefficient, we will use 
technology to calculate this standard deviation for us. Using the LinRegTTest with this data, 
scroll down through the output screens to find s=16.412 

Line Y2=-173.5+4.83x-2(16.4) and line Y3=-173.5+4.83X+2(16.4) 

Graph the scatterplot with the best fit line in equation Yl, then enter the two extra lines as Y2 and 
Y3 in the "Y="equation editor and press ZOOM 9. You will find that the only data point that is not 
between lines Y2 and Y3 is the point x=65, y=175. On the calculator screen it is just barely outside 
these lines. The outlier is the student who had a grade of 65 on the third exam and 175 on the final 
exam; this point is further than 2 standard deviations away from the best fit line. 

Sometimes a point is so close to the lines used to flag outliers on the graph that it is difficult to 
tell if the point is between or outside the lines. On a computer, enlarging the graph may help; on 
a small calculator screen, zooming in may make the graph more clear. Note that when the graph 
does not give a clear enough picture, you can use the numerical comparisons to identify outliers. 
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Figure 11.16 



Numerical Identification of Outliers 

In the table below, the first two columns are the third exam and final exam data. The third 

A A 

column shows the predicted y values calculated from the line of best fit: J/=-173.5+4.83x. The 
residuals, or errors, have been calculated in the fourth column of the table: observed y value — 



predicted y value = y- 



A 

y. 



A 

s is the standard deviation of all the y— J/= e values where n = the total number of data points. If 
each residual is calculated and squared, and the results are added up, we get the SSE. The standard 
deviation of the residuals is calculated from the SSE as: 



Rather than calculate the value of s ourselves, we can find s using the computer or calculator. For 
this example, our calculator LinRegTTest found s=16.4 as the standard deviation of the residuals: 
35; -17; 16; -6; -19; 9; 3; -1; -10; -9; -1. 
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X 


y 


A 

y 


A 

y-y 


65 


175 


140 


175 - 140 = 35 


67 


133 


150 


133-150= -17 


71 


185 


169 


185 - 169 = 16 


71 


163 


169 


163 - 169 = -6 


66 


126 


145 


126 - 145 = -19 


75 


198 


189 


198 - 189 = 9 


67 


153 


150 


153 - 150 = 3 


70 


163 


164 


163 - 164 = -1 


71 


159 


169 


159-169= -10 


69 


151 


160 


151 - 160 = -9 


69 


159 


160 


159 - 160 = -1 



Table 11.1 

We are looking for all data points for which the residual is greater than 2s=2(16.4)=32.8 or less than 
-32.8. Compare these values to the residuals in column 4 of the table. The only such data point is 
the student who had a grade of 65 on the third exam and 175 on the final exam; the residual for 
this student is 35. 

How does the outlier affect the best fit line? 

Numerically and graphically, we have identified the point (65,175) as an outlier. We should re- 
examine the data for this point to see if there are any problems with the data. If there is an error 
we should fix the error if possible, or delete the data. If the data is correct, we would leave it in 
the data set. For this problem, we will suppose that we examined the data and found that this 
outlier data was an error. Therefore we will continue on to delete the outlier, so that we can 
explore how it affects the results, as a learning experience. 

Compute a new best-fit line and correlation coefficient using the 10 remaining points: 

On the TI-83, TI-83+, TI-84+ calculators, delete the outlier from LI and L2. Using the LinRegTTest, 
the new line of best fit and the correlation coefficient are: 



A 

y-- 



-355.19 + 7.39x and r = 0.9121 



The new line with r = 0.9121 is a stronger correlation than the original (r=0.6631) because r = 
0.9121 is closer to 1. This means that the new line is a better fit to the 10 remaining data values. 
The line can better predict the final exam score given the third exam score. 



Example 11.13 

Using this new line of best fit (based on the remaining 10 data points), what would a student 
who receives a 73 on the third exam expect to receive on the final exam? Is this the same as the 
prediction made using the original line? 

NOTE: 

Remember, we do not always delete an outlier. If upon examination, we determined this outlier 
to be a valid data point, we would leave it in the data. 
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Example 11.14 

(From The Consumer Price Indexes Web site) The Consumer Price Index (CPI) measures the aver- 
age change over time in the prices paid by urban consumers for consumer goods and services. The 
CPI affects nearly all Americans because of the many ways it is used. One of its biggest uses is as 
a measure of inflation. By providing information about price changes in the Nation's economy to 
government, business, and labor, the CPI helps them to make economic decisions. The President, 
Congress, and the Federal Reserve Board use the CPI's trends to formulate monetary and fiscal 
policies. In the following table, x is the year and y is the CPI. 

Data: 



X 


y 


1915 


10.1 


1926 


17.7 


1935 


13.7 


1940 


14.7 


1947 


24.1 


1952 


26.5 


1964 


31.0 


1969 


36.7 


1975 


49.3 


1979 


72.6 


1980 


82.4 


1986 


109.6 


1991 


130.7 


1999 


166.6 



Table 11.2 



Problem 



Make a scatterplot of the data. 

A 

Calculate the least squares line. Write the equation in the form J/= a + bx. 

Draw the line on the scatterplot. 

Find the correlation coefficient. Is it significant? 

What is the average CPI for the year 1990? 

Are there any outliers in this data? 



Solution 

• Scatter plot and line of best fit. 

A 

• J/= —3204 + 1.662x is the equation of the line of best fit. 

• r = 0.8694 

• The number of data points is n = 14. Use the 95% Critical Values of the Sample Correlation 
Coefficient table at the end of Chapter 12. n — 2 = 12. The corresponding critical value is 
0.532. Since 0.8694 > 0.532, r is significant. 
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A 

y= -3204 + 1.662 (1990) = 103.4 CPI 

Using the calculator LinRegTTest, we find that s = 25.4 ; graphing the lines Y2=-3204+1.662X- 
2(25.4) and Y3=-3204+1.662X+2(25.4) shows that no data values are outside those lines, iden- 
tifying no outliers. (Note that the year 1999 was very close to the upper line, but still inside 
it.) 



a, 




1900 191 1 1922 1933 1944 1955 1966 1977 1988 19992010 
Year 



Figure 11.17 



Note: 

In example 3, notice the pattern of the points compared to the line. Although the correlation 
coefficient is significant, the pattern in the scatterplot indicates that a curve would be a more 
appropriate model to use than a line. In this example, a statistician should prefer to use other 
methods to fit a curve to this data, rather than model the data with the line we found. In addition 
to doing the calculations, it is always important to look at the scatterplot when deciding whether 
a linear model is appropriate. 

If you are interested in seeing more years of data for example 3, visit the Bureau of Labor Statis- 
tics CPI website ftp://ftp.bls.gov/pub/special.requests/cpi/cpiai.txt ; our data is taken from the 
column entitled "Annual Avg." (third column from the right). For example you could add more 
current years of data. Try adding the more recent years 2004 : CPI=188.9 and 2008 : CPI=215.3 and 
see how it affects the model. 
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11.10 Summary 10 

Bivariate Data: Each data point has two values. The form is (x,y). 

A 

Line of Best Fit or Least Squares Line (LSL): y= a + bx 

x = independent variable; y = dependent variable 

A 

Residual: Actual y value — predicted y value = y— V 
Correlation Coefficient r: 

1. Used to determine whether a line of best fit is good for prediction. 

2. Between -1 and 1 inclusive. The closer r is to 1 or -1, the closer the original points are to a straight line. 

3. If r is negative, the slope is negative. If r is positive, the slope is positive. 

4. If r — 0, then the line is horizontal. 

Sum of Squared Errors (SSE): The smaller the SSE, the better the original set of points fits the line of best 
fit. 

Outlier: A point that does not seem to fit the rest of the data. 

11.11 95% Critical Values of the Sample Correlation Coefficient Table 11 



Degrees of Freedom: n — 2 


Critical Values: (+ and — ) 


1 


0.997 


2 


0.950 


3 


0.878 


4 


0.811 


5 


0.754 


6 


0.707 


7 


0.666 


8 


0.632 


9 


0.602 


10 


0.576 


11 


0.555 


12 


0.532 


continued on next page 



10 This content is available online at <http://cnx.Org/content/ml7081/l.4/>. 
n This content is available online at <http://cnx.Org/content/ml7098/l.5/>. 
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13 


0.514 


14 


0.497 


15 


0.482 


16 


0.468 


17 


0.456 


18 


0.444 


19 


0.433 


20 


0.423 


21 


0.413 


22 


0.404 


23 


0.396 


24 


0.388 


25 


0.381 


26 


0.374 


27 


0.367 


28 


0.361 


29 


0.355 


30 


0.349 


40 


0.304 


50 


0.273 


60 


0.250 


70 


0.232 


80 


0.217 


90 


0.205 


100 and over 


0.195 



Table 11.3 
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11.12 Practice: Linear Regression 12 
11.12.1 Student Learning Outcomes 

• The student will explore the properties of linear regression. 



11.12.2 Given 

The data below are real. Keep in mind that these are only reported figures. (Source: Centers for Disease 
Control and Prevention, National Center for HIV, STD, and TB Prevention, October 24, 2003) 

Adults and Adolescents only, United States 



Year 


# AIDS cases diagnosed 


# AIDS deaths 


Pre-1981 


91 


29 


1981 


319 


121 


1982 


1,170 


453 


1983 


3,076 


1,482 


1984 


6,240 


3,466 


1985 


11,776 


6,878 


1986 


19,032 


11,987 


1987 


28,564 


16,162 


1988 


35,447 


20,868 


1989 


42,674 


27,591 


1990 


48,634 


31,335 


1991 


59,660 


36,560 


1992 


78,530 


41,055 


1993 


78,834 


44,730 


1994 


71,874 


49,095 


1995 


68,505 


49,456 


1996 


59,347 


38,510 


1997 


47,149 


20,736 


1998 


38,393 


19,005 


1999 


25,174 


18,454 


2000 


25,522 


17,347 


2001 


25,643 


17,402 


2002 


26,464 


16,371 


Total 


802,118 


489,093 



Table 11.4 



2 This content is available online at <http://cnx.Org/content/ml7088/l.8/>. 
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NOTE: We will use the columns "year" and "# AIDS cases diagnosed" for all questions unless 
otherwise stated. 



11.12.3 Graphing 

Graph "year" vs. "# AIDS cases diagnosed." Plot the points on the graph located below in the section 

titled "Plot" . Do not include pre-1981. Label both axes with words. Scale both axes. 

11.12.4 Data 

Exercise 11.12.1 

Enter your data into your calculator or computer. The pre-1981 data should not be included. Why 
is that so? 



11.12.5 Linear Equation 

Write the linear equation below, rounding to 4 decimal places: 

Exercise 11.12.2 (Solution on p. 306.) 

Calculate the following: 

a. a — 

b.b = 

c. corr. = 

d. n =(# of pairs) 

Exercise 11.12.3 (Solution on p. 306.) 

A 

equation: J/= 



11.12.6 Solve 

Exercise 11.12.4 (Solution on p. 306.) 

Solve. 

A 

a. When x = 1985, V= 

A 

b. When x = 1990, y= 



11.12.7 Plot 

Plot the 2 above points on the graph below. Then, connect the 2 points to form the regression line. 
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Obtain the graph on your calculator or computer. 

11.12.8 Discussion Questions 

Look at the graph above. 

Exercise 11.12.5 

Does the line seem to fit the data? Why or why not? 

Exercise 11.12.6 

Do you think a linear fit is best? Why or why not? 

Exercise 11.12.7 

Hand draw a smooth curve on the graph above that shows the flow of the data. 

Exercise 11.12.8 

What does the correlation imply about the relationship between time (years) and the number of 
diagnosed AIDS cases reported in the U.S.? 

Exercise 11.12.9 

Why is "year" the independent variable and "# AIDS cases diagnosed." the dependent variable 
(instead of the reverse)? 

Exercise 11.12.10 (Solution on p. 306.) 

Solve. 



a. When x = 1970, V=: 

b. Why doesn't this answer make sense? 
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11.13 Homework Link 13 

Link to homework questions in Homework Collection /Book for Linear Regression Chapter 12 of Collabo- 
rative Statistics for R. Bloom 

Chapter 12 Homework Problems ( http://cnx.org/content/m33266/latest/ ) 14 

11.14 (Untitled) 



13 This content is available online at <http://cnx.Org/content/ml9048/l.2/>. 

14 "Linear Regression and Correlation: Homework" <http://cnx.org/content/m33266/latest/> 
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Solutions to Exercises in Chapter 11 



Solution to Example 11.11, Problem 2 (p. 294) 

The x values in the data are between 65 and 75. 78 is outside of the domain of the observed x values in 
the data (independent variables), so you cannot reliably predict the final exam score for this student. (Even 
though it is possible to enter x into the equation and calculate a y value, you should not do so!) 
Solution to Example 11.13, Problem (p. 297) 

A 

Using the new line of best fit J/= -355.19 + 7.39(73) = 184.28. A student who scored 73 points on the third 
exam would expect to earn 184 points on the final exam. 

A 

The original line predicted J/= —173.51 + 4.83(73) = 179.08 so the prediction using the new line with the 
outlier eliminated differs from the original prediction. 

Solutions to Practice: Linear Regression 

Solution to Exercise 11.12.2 (p. 303) 

a. a = -3,448,225 

b. b = 1750 

c. corr. = 0.4526 

d. n = 22 

Solution to Exercise 11.12.3 (p. 303) 

A 

y= -3,448,225 +1750x 
Solution to Exercise 11.12.4 (p. 303) 

a. 25082 

b. 33,831 

Solution to Exercise 11.12.10 (p. 304) 
a. -1164 



Appendix 



12.1 Data Sets 1 

12.1.1 Lap Times 

The following tables provide lap times from Terri Vogel's Log Book. Times are recorded in seconds for 
2.5-mile laps completed in a series of races and practice runs. 

Race Lap Times (in Seconds) 





Lap 1 


Lap 2 


Lap 3 


Lap 4 


Lap 5 


Lap 6 


Lap 7 


Race 1 


135 


130 


131 


132 


130 


131 


133 


Race 2 


134 


131 


131 


129 


128 


128 


129 


Race 3 


129 


128 


127 


127 


130 


127 


129 


Race 4 


125 


125 


126 


125 


124 


125 


125 


Race 5 


133 


132 


132 


132 


131 


130 


132 


Race 6 


130 


130 


130 


129 


129 


130 


129 


Race 7 


132 


131 


133 


131 


134 


134 


131 


Race 8 


127 


128 


127 


130 


128 


126 


128 


Race 9 


132 


130 


127 


128 


126 


127 


124 


Race 10 


135 


131 


131 


132 


130 


131 


130 


Race 11 


132 


131 


132 


131 


130 


129 


129 


Race 12 


134 


130 


130 


130 


131 


130 


130 


Race 13 


128 


127 


128 


128 


128 


129 


128 


Race 14 


132 


131 


131 


131 


132 


130 


130 


Race 15 


136 


129 


129 


129 


129 


129 


129 


Race 16 


129 


129 


129 


128 


128 


129 


129 


Race 17 


134 


131 


132 


131 


132 


132 


132 


Race 18 


129 


129 


130 


130 


133 


133 


127 


Race 19 


130 


129 


129 


129 


129 


129 


128 


Race 20 


131 


128 


130 


128 


129 


130 


130 



lr rhis content is available online at <http://cnx.Org/content/ml7132/l.5/>. 
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APPENDIX 



Table 12.1 
Practice Lap Times (in Seconds) 





Lap 1 


Lap 2 


Lap 3 


Lap 4 


Lap 5 


Lap 6 


Lap 7 


Practice 1 


142 


143 


180 


137 


134 


134 


172 


Practice 2 


140 


135 


134 


133 


128 


128 


131 


Practice 3 


130 


133 


130 


128 


135 


133 


133 


Practice 4 


141 


136 


137 


136 


136 


136 


145 


Practice 5 


140 


138 


136 


137 


135 


134 


134 


Practice 6 


142 


142 


139 


138 


129 


129 


127 


Practice 7 


139 


137 


135 


135 


137 


134 


135 


Practice 8 


143 


136 


134 


133 


134 


133 


132 


Practice 9 


135 


134 


133 


133 


132 


132 


133 


Practice 10 


131 


130 


128 


129 


127 


128 


127 


Practice 11 


143 


139 


139 


138 


138 


137 


138 


Practice 12 


132 


133 


131 


129 


128 


127 


126 


Practice 13 


149 


144 


144 


139 


138 


138 


137 


Practice 14 


133 


132 


137 


133 


134 


130 


131 


Practice 15 


138 


136 


133 


133 


132 


131 


131 



Table 12.2 



12.1.2 Stock Prices 

The following table lists initial public offering (IPO) stock prices for all 1999 stocks that at least doubled in 
value during the first day of trading. This is historical data. 
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IPO Offer Prices 



$17.00 


$23.00 


$14.00 


$16.00 


$12.00 


$26.00 


$20.00 


$22.00 


$14.00 


$15.00 


$22.00 


$18.00 


$18.00 


$21.00 


$21.00 


$19.00 


$15.00 


$21.00 


$18.00 


$17.00 


$15.00 


$25.00 


$14.00 


$30.00 


$16.00 


$10.00 


$20.00 


$12.00 


$16.00 


$17.44 


$16.00 


$14.00 


$15.00 


$20.00 


$20.00 


$16.00 


$17.00 


$16.00 


$15.00 


$15.00 


$19.00 


$48.00 


$16.00 


$18.00 


$9.00 


$18.00 


$18.00 


$20.00 


$8.00 


$20.00 


$17.00 


$14.00 


$11.00 


$16.00 


$19.00 


$15.00 


$21.00 


$12.00 


$8.00 


$16.00 


$13.00 


$14.00 


$15.00 


$14.00 


$13.41 


$28.00 


$21.00 


$17.00 


$28.00 


$17.00 


$19.00 


$16.00 


$17.00 


$19.00 


$18.00 


$17.00 


$15.00 




$14.00 


$21.00 


$12.00 


$18.00 


$24.00 




$15.00 


$23.00 


$14.00 


$16.00 


$12.00 




$24.00 


$20.00 


$14.00 


$14.00 


$15.00 




$14.00 


$19.00 


$16.00 


$38.00 


$20.00 




$24.00 


$16.00 


$8.00 


$18.00 


$17.00 




$16.00 


$15.00 


$7.00 


$19.00 


$12.00 




$8.00 


$23.00 


$12.00 


$18.00 


$20.00 




$21.00 


$34.00 


$16.00 


$26.00 


$14.00 





Table 12.3 



NOTE: Data compiled by Jay R. Ritter of Univ. of Florida using data from Securities Data Co. and 
Bloomberg. 
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12.2 Symbols and their Meanings 2 

Symbols and their Meanings 



Chapter (1st used) 


Symbol 


Spoken 


Meaning 










Sampling and Data 


V 


The square root of 


same 


Sampling and Data 


TC 


Pi 


3.14159. . . (a specific 
number) 


Descriptive Statistics 


Qi 


Quartile one 


the first quartile 


Descriptive Statistics 


Q2 


Quartile two 


the second quartile 


Descriptive Statistics 


Q3 


Quartile three 


the third quartile 


Descriptive Statistics 


IQR 


inter-quartile range 


Q3-Q1=IQR 


Descriptive Statistics 


X 


x-bar 


sample mean 


Descriptive Statistics 


¥ 


mu 


population mean 


Descriptive Statistics 


b b % OjC 


s 


sample standard devia- 
tion 


Descriptive Statistics 


2 2 

* 4 


s-squared 


sample variance 


Descriptive Statistics 


a a x ax 


sigma 


population standard 
deviation 


Descriptive Statistics 


2 2 

cr PJ 


sigma-squared 


population variance 


Descriptive Statistics 


E 


capital sigma 


sum 


Probability Topics 


{} 


brackets 


set notation 


Probability Topics 


S 


S 


sample space 


Probability Topics 


A 


Event A 


event A 


Probability Topics 


P(A) 


probability of A 


probability of A occur- 
ring 


Probability Topics 


P(A\B) 


probability of A given B 


prob. of A occurring 
given B has occurred 


Probability Topics 


P(AorB) 


prob. of A or B 


prob. of A or B or both 
occurring 


continued on next page 



2 This content is available online at <http://cnx.org/content/ml6302/1.9/>. 
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Probability Topics 


P(AandB) 


prob. of A and B 


prob. of both A and B 
occurring (same time) 


Probability Topics 


A' 


A-prime, complement 
of A 


complement of A, not A 


Probability Topics 


P(A') 


prob. of complement of 
A 


same 


Probability Topics 


G x 


green on first pick 


same 


Probability Topics 


P(Gi) 


prob. of green on first 
pick 


same 


Discrete Random Vari- 
ables 


PDF 


prob. distribution func- 
tion 


same 


Discrete Random Vari- 
ables 


X 


X 


the random variable X 


Discrete Random Vari- 
ables 


X ~ 


the distribution of X 


same 


Discrete Random Vari- 
ables 


B 


binomial distribution 


same 


Discrete Random Vari- 
ables 


G 


geometric distribution 


same 


Discrete Random Vari- 
ables 


H 


hypergeometric dist. 


same 


Discrete Random Vari- 
ables 


P 


Poisson dist. 


same 


Discrete Random Vari- 
ables 


A 


Lambda 


average of Poisson dis- 
tribution 


Discrete Random Vari- 
ables 


> 


greater than or equal to 


same 


Discrete Random Vari- 
ables 


< 


less than or equal to 


same 


Discrete Random Vari- 
ables 


= 


equal to 


same 


Discrete Random Vari- 
ables 


¥= 


not equal to 


same 


continued on next page 
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Continuous Random 
Variables 


f(x) 


fofx 


function of x 


Continuous Random 
Variables 


pdf 


prob. density function 


same 


Continuous Random 
Variables 


U 


uniform distribution 


same 


Continuous Random 
Variables 


Exp 


exponential distribu- 
tion 


same 


Continuous Random 
Variables 


k 


k 


critical value 


Continuous Random 
Variables 


/(*) = 


f of x equals 


same 


Continuous Random 
Variables 


m 


m 


decay rate (for exp. 
dist.) 


The Normal Distribu- 
tion 


N 


normal distribution 


same 


The Normal Distribu- 
tion 


z 


z-score 


same 


The Normal Distribu- 
tion 


Z 


standard normal dist. 


same 


The Central Limit The- 
orem 


CLT 


Central Limit Theorem 


same 


The Central Limit The- 
orem 


X 


X-bar 


the random variable X- 
bar 


The Central Limit The- 
orem 


Y-x 


mean of X 


the average of X 


The Central Limit The- 
orem 


V-x 


mean of X-bar 


the average of X-bar 


The Central Limit The- 
orem 


o- x 


standard deviation of X 


same 


The Central Limit The- 
orem 


v% 


standard deviation of 
X-bar 


same 


The Central Limit The- 
orem 


EX 


sum of X 


same 


continued on next page 
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The Central Limit The- 
orem 


Ex 


sum of x 


same 


Confidence Intervals 


CL 


confidence level 


same 


Confidence Intervals 


CI 


confidence interval 


same 


Confidence Intervals 


EBM 


error bound for a mean 


same 


Confidence Intervals 


EBP 


error bound for a pro- 
portion 


same 


Confidence Intervals 


t 


student-t distribution 


same 


Confidence Intervals 


df 


degrees of freedom 


same 


Confidence Intervals 


t« 

2 


student-t with a/2 area 
in right tail 


same 


Confidence Intervals 


A 

v'v 


p-prime; p-hat 


sample proportion of 
success 


Confidence Intervals 


A 


q-prime; q-hat 


sample proportion of 
failure 


Hypothesis Testing 


Ho 


H-naught, H-sub 


null hypothesis 


Hypothesis Testing 


H a 


H-a, H-sub a 


alternate hypothesis 


Hypothesis Testing 


Hi 


H-l, H-sub 1 


alternate hypothesis 


Hypothesis Testing 


a 


alpha 


probability of Type I er- 
ror 


Hypothesis Testing 


J8 


beta 


probability of Type II 
error 


Hypothesis Testing 


XT-X2 


Xl-bar minus X2-bar 


difference in sample 
means 




Fi ~F2 


mu-1 minus mu-2 


difference in popula- 
tion means 




P'i ~ P'i 


Pl-prime minus P2- 
prime 


difference in sample 
proportions 




Pi -P2 


pi minus p2 


difference in popula- 
tion proportions 


Chi-Square Distribu- 
tion 


X 2 


Ky-square 


Chi-square 


continued on next page 
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O 


Observed 


Observed frequency 




E 


Expected 


Expected frequency 


Linear Regression and 
Correlation 


y = a + bx 


y equals a plus b-x 


equation of a line 




A 

y 


y-hat 


estimated value of y 




r 


correlation coefficient 


same 




£ 


error 


same 




SSE 


Sum of Squared Errors 


same 




1.9s 


1.9 times s 


cut-off value for out- 
liers 


F-Distribution and 
ANOVA 


F 


F-ratio 


F ratio 



Table 12.4 
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12.3 English Phrases Written Mathematically 3 
12.3.1 English Phrases Written Mathematically 



When the English says: 


Interpret this as: 






Xis at least 4. 


X >4 


XThe minimum is 4. 


X >4 


X is no less than 4. 


X >4 


X is greater than or equal to 4. 


X >4 






X is at most 4. 


X <4 


XThe maximum is 4. 


X <4 


Xis no more than 4. 


X <4 


X is less than or equal to 4. 


X <4 


Xdoes not exceed 4. 


X <4 






Xis greater than 4. 


X >4 


XThere are more than 4. 


X >4 


Xexceeds 4. 


X >4 






Xis less than 4. 


X <4 


XThere are fewer than 4. 


X <4 






Xis 4. 


X = 4 


Xis equal to 4. 


X = 4 


Xis the same as 4. 


X = 4 






Xis not 4. 


X/4 


Xis not equal to 4. 


X ^4 


Xis not the same as 4. 


X/4 


Xis different than 4. 


X ^4 







Table 12.5 



3 This content is available online at <http://cnx.org/content/ml6307/1.5/>. 
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12.4 Formulas 4 

Formula 12.1: Factorial 
n\ — n(n-l) (n - 2) ... (1) 

0! = 1 

Formula 12.2: Combinations 

/n\ _ n! 

vr/ («-r)!r! 

Formula 12.3: Binomial Distribution 
X- B(n,p) 

P(X = x) = (")p x q"- x , for x = 0,1,2,..., n 

Formula 12.4: Geometric Distribution 

X~G(p) 

P(X = x) = q x ~ 1 p , for x= 1,2,3,... 

Formula 12.5: Hypergeometric Distribution 

X~H(r,b,n) 

Formula 12.6: Poisson Distribution 

X-P(^) 



P(X = x) 

Formula 12.7: Uniform Distribution 

X- U(a,b) 

f(X) = ^- a ,a<x<b 

Formula 12.8: Exponential Distribution 

X ~ Exp (m) 

f (x) = me~ mx , m > 0,x > 

Formula 12.9: Normal Distribution 

X- N (}l,cr 2 ) 

-(x-,i) 2 

f (x) = T=e 2a 2 — OO < X < OO 

J y ' crV2n 

Formula 12.10: Gamma Function 
r (z) = /" x z ~ 1 e- x dx z > 

r(|) = ^ 

r (m + 1) — ml for m, a nonnegative integer 

otherwise: Y (a + 1) — aT (a) 
Formula 12.11: Student-t Distribution 

4 This content is available online at <http://cnx.Org/content/ml6301/l.7/>. 
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,, -("+D 



m-W^™ 



VnnT(^) 



X=^ 



Z ~ N (0, 1) , Y ~ XJL ,n = degrees of freedom 

Formula 12.12: Chi-Square Distribution 

X ~ XL 

f (x) = x n e 2 , x > , n = positive integer and degrees of freedom 

v 2ir(f) r 

Formula 12.13: F Distribution 

X ~ F df(n),df(d) 

df (n) =degrees of freedom for the numerator 
df (d) =degrees of freedom for the denominator 

/W = f^|y(l) ! * (| - l) [ 1 + (S) I "" 5( " +, ' ) ] 

X = ,&- , Y, W are chi-square 
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12.5 Notes for the TI-83, 83+, 84 Calculator 5 

12.5.1 Quick Tips 
Legend 

• I J represents a button press 

• [ ] represents yellow command or green letter behind a key 

• < > represents items on the screen 

To adjust the contrast 

Press J , then hold ^b^B to increase the contrast or VA^B to decrease the contrast. 

To capitalize letters and words 

Press - J to get one capital letter, or press - J , then . J to set all button presses to capital 

letters. You can return to the top-level button values by pressing . J again. 

To correct a mistake ^^^^ 

If you hit a wrong button, just hit UflUiJ and start again. 

To write in scientific notation 

Numbers in scientific notation are expressed on the TI-83, 83+, and 84 using E notation, such that... 

• 4.321 E 4 = 4.321 x 10 4 

• 4.321 E -4 = 4.321 x 10" 4 

To transfer programs or equations from one calculator to another: 

Both calculators: Insert your respective end of the link cable cable and press J , then [LINK] . 

Calculator receiving information: 

Step 1. Use the arrows to navigate to and select <RECEIVE> 
Step 2. Press \jZU3j 

Calculator sending information: 

Step 1 . Press appropriate number or letter. 

Step 2. Use up and down arrows to access the appropriate item. 

Step 3. Press U^uJ to select item to transfer. 

Step 4. Press right arrow to navigate to and select <TRANSMIT>. 

Step 5. Press yyED 

NOTE: EPvROR 35 LINK generally means that the cables have not been inserted far enough. 



Both calculators: Insert your respective end of the link cable cable Both calculators: press J , then 

[QUIT] To exit when done. 



5 This content is available online at <http://cnx.org/content/ml9710/1.6/>. 
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12.5.2 Manipulating One- Variable Statistics 

NOTE: These directions are for entering data with the built-in statistical program. 

Sample Data 



Data 


Frequency 


-2 


10 


-1 


3 





4 


1 


5 


3 


8 



Table 12.6: We are manipulating 1-variable statistics. 



To begin: 



Step 1 . Turn on the calculator. 



Step 2. Access statistics mode. 



Step 3. Select <4 : ClrList > to clear data from lists, if desired. 

4 i flfina 

Step 4. Enter list [LI] to be cleared. 

j,[li],GM3 

Step 5. Display last instruction. 

. J, [ENTRY] 

Step 6. Continue clearing remaining lists in the same fashion, if desired. 

KM J,[L2],GM3 

Step 7. Access statistics mode. 



Step 8. Select <1: Edit . . .> 

Step 9. Enter data. Data values go into [LI] . (You may need to arrow over to [LI] ) 

• Type in a data value and enter it. (For negative numbers, use the negate (-) key at the bottom of 
the keypad) 

JzU_LJ,GM3 

• Continue in the same manner until all data values are entered. 
Step 10. In [L2] , enter the frequencies for each data value in [LI] . 

• Type in a frequency and enter it. (If a data value appears only once, the frequency is "1") 

4 1 OT31 

• Continue in the same manner until all data values are entered. 
Step 11. Access statistics mode. 
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Step 12. Navigate to <CALC> 
Step 13. Access <1: 1-var Stats> 

Step 14. Indicate that the data is in [LI] ... 

J, [li] SB 

Step 15. ...and indicate that the frequencies are in [L2] . 

J,[ L2 ],GM3 

Step 16. The statistics should be displayed. You may arrow down to get remaining statistics. Repeat as neces- 
sary. 

12.5.3 Drawing Histograms 

NOTE: We will assume that the data is already entered 

We will construct 2 histograms with the built-in STATPLOT application. The first way will use the default 
ZOOM. The second way will involve customizing a new graph. 

Step 1. Access graphing mode. 

. J, [STAT PLOT] 

Step 2. Select <1 :plot 1> To access plotting - first graph. 

Step 3. Use the arrows navigate go to <0N> to turn on Plot 1. 

<o N >,(32DB 

Step 4. Use the arrows to go to the histogram picture and select the histogram. 

Anna 

Step 5. Use the arrows to navigate to <Xlist> 
Step 6. If "LI" is not selected, select it. 

J,[Li],GHD 

Step 7. Use the arrows to navigate to <Freq>. 
Step 8. Assign the frequencies to [L2] . 

J,[L2],(H33 

Step 9. Go back to access other graphs. 

J, [STAT PLOT] 

Step 10. Use the arrows to turn off the remaining plots. 

Step 11. Be sure to deselect or clear all equations before graphing. 

To deselect equations: 

Step 1. Access the list of equations. 



Step 2. Select each equal sign (=). 

Step 3. Continue, until all equations are deselected. 
To clear equations: 
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Step 1. Access the list of equations. 



Step 2. Use the arrow keys to navigate to the right of each equal sign (=) and clear them. 

Step 3. Repeat until all equations are deleted. 

To draw default histogram: 

Step 1. Access the ZOOM menu. 

Step 2. Select <9:ZoomSt at > 

Step 3. The histogram will show with a window automatically set. 
To draw custom histogram: 



Step 1. Access UliUill to set the graph parameters. 
Step 2. • X m i„ = —2.5 

• X-max = •3-3 

• Xscl = 1 (width of bars) 

• '■mm " 

• * max = tU 

• Yscl = 1 (spacing of tick marks on y-axis) 

• X r es = 1 

Step 3. Access ViLUllI to see the histogram. 

To draw box plots: 

Step 1. Access graphing mode. 

J, [STAT PLOT] 

Step 2. Select < 1 : Plot 1 > to access the first graph. 

Step 3. Use the arrows to select <0N> and turn on Plot 1. 

Step 4. Use the arrows to select the box plot picture and enable it. 

Step 5. Use the arrows to navigate to <Xlist> 
Step 6. If "LI" is not selected, select it. 

j,[li],GM3 

Step 7. Use the arrows to navigate to <Freq>. 
Step 8. Indicate that the frequencies are in [L2] . 

J,[ L2 ],GM3 

Step 9. Go back to access other graphs. 

. J, [STAT PLOT] 

Step 10. Be sure to deselect or clear all equations before graphing using the method mentioned above. 
Step 11. View the box plot. 

FMJm , [STAT PLOT] 
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12.5.4 Linear Regression 

12.5.4.1 Sample Data 

The following data is real. The percent of declared ethnic minority students at De Anza College for selected 
years from 1970 - 1995 was: 



Year 


Student Ethnic Minority Percentage 


1970 


14.13 


1973 


12.27 


1976 


14.08 


1979 


18.16 


1982 


27.64 


1983 


28.72 


1986 


31.86 


1989 


33.14 


1992 


45.37 


1995 


53.1 



Table 12.7: The independent variable is "Year," while the independent variable is "Student Ethnic Minority 

Percent." 



Student Ethnic Minority Percentage 



60 
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t 40 

CD 
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■ 
m m ■ 
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19 


60 


i i i i 
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Year 



Figure 12.1: By hand, verify the scatterplot above. 
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NOTE: The TI-83 has a built-in linear regression feature, which allows the data to be edited. The 
x-values will be in [LI] ; the y-values in [L2] . 

To enter data and do linear regression: 

Step 1 . ON Turns calculator on 



LI 

Step 2. Before accessing this program, be sure to turn off all plots. 

• Access graphing mode. 

J, [STAT PLOT] 

• Turn off all plots. 

4 i ntini 

Step 3. Round to 3 decimal places. To do so: 

• Access the mode menu. 
GHHIj. [STAT PLOT] 

• Navigate to <Float> and then to the right to <3>. 



• All numbers will be rounded to 3 decimal places until changed. 
Step 4. Enter statistics mode and clear lists [LI] and [L2] , as describe above. 

I57E1 4 i 

Step 5. Enter editing mode to insert values for x and y. 

ESI QM3 

Step 6. Enter each value. Press UllUtf to continue. 

To display the correlation coefficient: 

Step 1. Access the catalog. 
. J, [CATALOG] 



Step 2. Arrow down and select <DiagnosticOn> 

Step 3. r and r 2 will be displayed during regression calculations. 
Step 4. Access linear regression. 



Step 5. Select the form of y = a + bx 

8 i flfini 

The display will show: 
LinReg 

• y = a + bx 

• a = -3176.909 

• b = 1.617 

• r 2 = 0.924 

• r = 0.961 
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This means the Line of Best Fit (Least Squares Line) is: 

• y = -3176.909 + 1.617a; 

• Percent - -3176.909 + 1.617(year #) 

The correlation coefficient r = 0.961 
To see the scatter plot: 

Step 1. Access graphing mode. 

. J, [STAT PLOT] 

Step 2. Select <1 :plot 1> To access plotting - first graph. 

Step 3. Navigate and select <0N> to turn on Plot 1. 

<on>GM3 

Step 4. Navigate to the first picture. 
Step 5. Select the scatter plot. 

Step 6. Navigate to <Xlist> 

Step 7. If [LI] is not selected, press J , [LI] to select it. 

Step 8. Confirm that the data values are in [LI] . 

<o N >GM3 

Step 9. Navigate to <Ylist> 
Step 10. Select that the frequencies are in [L2] . 

J,[ L2 ],(3M3 

Step 11. Go back to access other graphs. 

. J, [STAT PLOT] 

Step 12. Use the arrows to turn off the remaining plots. 



Step 13. Access UUUiil to set the graph parameters. 

• X min = 1970 

• X mflI = 2000 

• X sc i = 10 (spacing of tick marks on x-axis) 

• Y miH = -0.05 

• imax = 60 

• Yscl = 10 (spacing of tick marks on y-axis) 

• X r es = 1 

Step 14. Be sure to deselect or clear all equations before graphing, using the instructions above. 
Step 15. Press ViLuuJ to see the scatter plot. 

To see the regression graph: 

Step 1. Access the equation menu. The regression equation will be put into Yl. 



Step 2. Access the vars menu and navigate to <5 : Statistics> 

FETE1 5 I 

Step 3. Navigate to <EQ>. 

Step 4. < 1 : RegEQ > contains the regression equation which will be entered in Yl . 
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Step 5. Press |£LuilJ . The regression line will be superimposed over scatter plot. 

To see the residuals and use them to calculate the critical point for an outlier: 

Step 1. Access the list. RESID will be an item on the menu. Navigate to it. 

J, [LIST], <RESID> 

Step 2. Confirm twice to view the list of residuals. Use the arrows to select them. 

otto fiflni 

Step 3. The critical point for an outlier is: 1.9V^4 where: 

• n = number of pairs of data 

• SSE = sum of the squared errors 

• X] residual 

Step 4. Store the residuals in [L3] . 

ED J,[L3],GM3 

Step 5. Calculate the illl^fL. Note that n - 2 = 8 

J,[L3],«»,«»,_8_J 

Step 6. Store this value in [L4] . 

GE3 J,[L4],GM3 

Step 7. Calculate the critical value using the equation above. 

^LJ,^J,^_J,^M J,[v] J, msn ■*,■*, _LJ J 

Step 8. Verify that the calculator displays: 7.642669563. This is the critical value. 

Step 9. Compare the absolute value of each residual value in [L3] to 7.64 . If the absolute value is greater 
than 7.64, then the (x, y) corresponding point is an outlier. In this case, none of the points is an outlier. 

To obtain estimates of y for various x-values: 

There are various ways to determine estimates for "y". One way is to substitute values for "x" in the 

equation. Another way is to use the IL^m on the graph of the regression line. 

12.5.5 TI-83, 83+, 84 instructions for distributions and tests 
12.5.5.1 Distributions 

Access DISTR (for "Distributions"). 

For technical assistance, visit the Texas Instruments website at http://www.ti.com 6 and enter your calcu- 
lator model into the "search" box. 

Binomial Distribution 

• binompdf (n , p , x) corresponds to P(X = x) 

• binomcdf (n , p , x) corresponds to P(X < x) 

• To see a list of all probabilities for x: 0, 1, . . . , n, leave off the "x" parameter. 

Poisson Distribution 

• poissonpdf ( A , x) corresponds to P(X = x) 

• poissoncdf (A,x) corresponds to P(X < x) 



6 http:/ /www. ti.com 
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Continuous Distributions (general) 

• — oo uses the value -1EE99 for left bound 

• oo uses the value 1EE99 for right bound 

Normal Distribution 

• normalpdf (x,}i,a) yields a probability density function value (only useful to plot the normal curve, 
in which case "x" is the variable) 

• normalcdf (left bound, right bound, ft , a) corresponds to P(left bound < X < right bound) 

• normalcdf (left bound, right bound) corresponds to P(left bound < Z < right bound) - standard 
normal 

• invNorm (p , \i , a) yields the critical value, k: P(X < k) = p 

• invNorm(p) yields the critical value, k: P(Z < k) = p for the standard normal 

Student-t Distribution 

• tpdf (x,df) yields the probability density function value (only useful to plot the student-t curve, in 
which case "x" is the variable) 

• tcdf(left bound, right bound, df) corresponds to P(left bound < t < right bound) 

Chi-square Distribution 

• X 2 pdf (x , df ) yields the probability density function value (only useful to plot the chi 2 curve, in which 
case "x" is the variable) 

• X 2 cdf(left bound, right bound, df) corresponds to P(left bound < X 2 < right bound) 

F Distribution 

• Fpdf (x , df num, df denom) yields the probability density function value (only useful to plot the F curve, 
in which case "x" is the variable) 

• Fcdf(left bound, right bound, df num, df denom) corresponds to P(left bound < F < right bound) 

12.5.5.2 Tests and Confidence Intervals 

Access STAT and TESTS. 

For the Confidence Intervals and Hypothesis Tests, you may enter the data into the appropriate lists and 
press DATA to have the calculator find the sample means and standard deviations. Or, you may enter the 
sample means and sample standard deviations directly by pressing STAT once in the appropriate tests. 

Confidence Intervals 

• ZInterval is the confidence interval for mean when <r is known 

• TInterval is the confidence interval for mean when <r is unknown; s estimates c. 

• 1-PropZInt is the confidence interval for proportion 

NOTE: The confidence levels should be given as percents (ex. enter "95" for a 95% confidence 
level). 

Hypothesis Tests 

• Z-Test is the hypothesis test for single mean when a is known 

• T-Test is the hypothesis test for single mean when a is unknown; s estimates cr. 

• 2-SampZTest is the hypothesis test for 2 independent means when both cr's are known 
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• 2-SampTTest is the hypothesis test for 2 independent means when both c's are unknown 

• 1-PropZTest is the hypothesis test for single proportion. 

• 2-PropZTest is the hypothesis test for 2 proportions. 

• X 2 -Test is the hypothesis test for independence. 

• X 2 G0F-Test is the hypothesis test for goodness-of-fit (TI-84+ only). 

• LinRegTTEST is the hypothesis test for Linear Regression (TI-84+ only). 



NOTE: Input the null hypothesis value in the row below "Inpt." For a test of a single mean, "^0" 
represents the null hypothesis. For a test of a single proportion, "p0" represents the null hypothe- 
sis. Enter the alternate hypothesis on the bottom row. 
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12.6 Links to Probability Tables 7 



NOTE: When you are finished with the table link, use the back button on your browser to return 
here. 

Tables (NIST/SEMATECH e-Handbook of Statistical Methods, http://www.itl.nist.gov/div898/handbook/, 
January 3, 2009) 

• Student-t table 8 

• Normal table 9 

• Chi-Square table 10 

• F-table 11 

• All four tables can be accessed by going to http://www.itl.nist.gov/div898/handbook/eda/section3/eda367.htm 12 

95% Critical Values of the Sample Correlation Coefficient Table 

• 95% Critical Values of the Sample Correlation Coefficient 13 

NOTE: The url for this table is http://cnx.org/content/ml7098/latest/ 



7 This content is available online at <http://cnx.org/content/ml9138/13/>. 
8 http:// www.itl.nist.gov/div898/handbook/eda/section3/eda3672.htm 
9 http://www.itl.nist.gov/div898/handbook/eda/section3/eda3671.htm 
10 http://www.itl.nist.gov/div898/handbook/eda/section3/eda3674.htm 
n http:// www.itl.nist.gov/div898/handbook/eda/section3/eda3673. htm 
12 http:// www.itl.nist.gov/div898/handbook/eda/section3/eda367.htm 
13 http://cnx.org/content/ml7098/latest/ 
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A Average 

A number that describes the central tendency of the data. There are a number of specialized 
averages, including the arithmetic mean, weighted mean, median, mode, and geometric mean. 

B Bernoulli Trials 

An experiment with the following characteristics: 

• There are only 2 possible outcomes called "success" and "failure" for each trial. 

• The probability p of a success is the same for any trial (so the probability q = 1 — p of a 
failure is the same for any trial). 

Binomial Distribution 

A discrete random variable (RV) which arises from Bernoulli trials. There are a fixed number, n, 
of independent trials. "Independent" means that the result of any trial (for example, trial 1) 
does not affect the results of the following trials, and all trials are conducted under the same 
conditions. Under these circumstances the binomial RV X is defined as the number of successes 
in n trials. The notation is: X~B {n, p). The mean is ]i = np and the standard deviation is 
c = ^Jnpq. The probability of exactly x successes in n trials is P (X = x) = (™) p x q n ~ x . 

Binomial Distribution 

A discrete random variable (RV) which arises from the Bernoulli trials with the next additional 
requirements. There are fixed number, n, of independent trials. "Independent" means that the 
result to any trial (for example, trial 1) in no way affects the answer to all the following trials, 
and all trials are conducted under the same conditions. Under these circumstances the binomial 
RV X is defined as the number of success in n trials. The notation is: X<~B (n, p); the domain is 
the mean is }i = np, and the variance is a 2 = df. The probability to have exactly x successes in n 
trials is P (X = x) = (£) p x q n ~ x . 

C Central Limit Theorem 

Given a random variable (RV) with known mean ji and known standard deviation a. We are 
sampling with size n and we are interested in two new RVs - the sample mean, X, and the 

sample sum, EX. If the size n of the sample is sufficiently large, then X~ N ( ]i, -y= ) and EX ~ 

N (nji, \/na) . If the size n of the sample is sufficiently large, then the distribution of the sample 
means and the distribution of the sample sums will approximate a normal distribution 
regardless of the shape of the population. The mean of the sample means will equal the 
population mean and the mean of the sample sums will equal n times the population mean. 
The standard deviation of the distribution of the sample means, -?=, is called the standard error 

of the mean. 

Central Limit Theorem 

Given a random variable (RV) with known mean \i and known variance a 2 , we are sampling 

■ 
with size n and we are interested in two new RV - sample mean, X,and sample sum,E X. If the 

size n of the sample is sufficiently large, then X~ N ( nji, ?— ) and EX ~ N (nji, n, c 2 ) . In 
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words, if the size n of the sample is sufficiently large, then the distribution of the sample means 
and the distribution of the sample sums will approximate a normal distribution regardless of 
the shape of the population. And even more, the mean of the sampling distribution will equal 
the population mean and mean of sampling sums will equal n times the population mean. The 
standard deviation of the distribution of the sample means, -j=, is called standard error of the 
mean. 

Coefficient of Correlation 

A measure developed by Karl Pearson (early 1900s) that gives the strength of association 
between the independent variable and the dependent variable. The formula is: 

nX>y-(Ex)(Ey) (116) 



«E* 2 



(Ex) 2 ] [«Ey 2 -(Ey) 2 



where n is the number of data points. The coefficient cannot be more then 1 and less then -1. 
The closer the coefficient is to ±1, the stronger the evidence of a significant linear relationship 
between x and y. 

Conditional Probability 

The likelihood that an event will occur given that another event has already occurred. 

Confidence Interval (CI) 

An interval estimate for an unknown population parameter. This depends on: 

• The desired confidence level. 

• Information that is known about the distribution (for example, known standard deviation). 

• The sample and its size. 

Confidence Interval 

An interval estimate for unknown population parameter. This depends on: 

• The desired confidence level. 

• What is known for the distribution information (for ex., known variance). 

• Gathering from the sampling information. 

Confidence Level 

The percent expression for the probability that the confidence interval contains the true 
population parameter. That is, for ex., if CL=90%, then in 90 out of 100 samples the interval 
estimate will enclose the true population parameter. 

Contingency Table 

The method of displaying a frequency distribution in case of dependable (contingent) variables; 
the table provides the easy way to calculate conditional probabilities. 

Continuous Random Variable 

A random variable (RV) whose outcomes are measured. 

Example: The height of trees in the forest is a continuous RV. 

Cumulative Relative Frequency 

The term applies to an ordered set of observations from smallest to largest. The Cumulative 
Relative Frequency is the sum of the relative frequencies for all values that are less than or equal 
to the given value. 



D Data 
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A set of observations (a set of possible outcomes). Most data can be put into two groups: 
qualitative (hair color, ethnic groups and other attributes of the population) and quantitative 
(distance traveled to college, number of children in a family, etc.). Quantitative data can be 
separated into two subgroups: discrete and continuous. Data is discrete if it is the result of 
counting (the number of students of a given ethnic group in a class, the number of books on a 
shelf, etc.). Data is continuous if it is the result of measuring (distance traveled, weight of 
luggage, etc.) 

Degrees of Freedom (df) 

The number of objects in a sample that are free to vary. 
Discrete Random Variable 

A random variable (RV) whose outcomes are counted. 

E Equally Likely 

Each outcome of an experiment has the same probability. 

Error Bound for a Population Mean (EBM) 

The margin of error. Depends on the confidence level, sample size, and known or estimated 
population standard deviation. 

Error Bound for a Population Proportion (EBP) 

The margin of error. Depends on the confidence level, sample size, and estimated population 
proportion. 

Event 

A subset in the set of all outcomes of an experiment. The set of all outcomes of an experiment is 
called a sample space and denoted, as a rule, by S. An event is any arbitrary subset in S: it can 
contain one outcome, two outcomes, and even no outcomes (empty subset) or all of them 
(sample space). Standard notations for events are capital letters such as A, B, C, etc. 

Expected Value 

Expected arithmetic average when an experiment is repeated many times. (Also called the 
mean). Notations: E (x) , \i. For a discrete random variable (RV) with probability distribution 
function P (x) ,the definition can also be written in the form E (x) = ji — 7J xP (x) . 

Experiment 

A planned activity carried out under controlled conditions. 

Exponential Distribution 

A continuous random variable (RV) that appears when we are interested in the intervals of time 
between some random events, for example, the length of time between emergency arrivals at a 
hospital. Notation: X~Exp (in). The mean is ]i — jh and the standard deviation is a = jr. The 
probability density function is / (x) = me _mx , x > and the cumulative distribution function 
is P (X < x) = 1 - e" mx . 

Exponential Distribution 

Continuous random variable (RV) that appears when we are interested in intervals of time 
between some random events, for example, the length of time between emergency arrivals at a 
hospital. Notation: X ~ Exp (m); the mean is ]i — i, and the variance is a 2 = X il the probability 
density function is / (x) = me _mx , x > Oand cumulative distribution is P (X < x) = 1 — e~ 
Exponential Distribution 



-mx 
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Continuous random variable (RV) that appears when we are interested in intervals of time 
between some random events, for example, the length of time between emergency arrivals at a 
hospital. Notation: X~Exp (in); the mean is ]i = ^, and the variance is c 2 = -\, the probability 
density function is / (x) = me _mx , x > Oand cumulative distribution is P (X < x) = 1 — e _mx . 

F Frequency 

The number of times a value of the data occurs. 

G Geometric Distribution 

A discrete random variable (RV) which arises from the Bernoulli trials. The trials are repeated 
until the first success. The geometric variable X is defined as the number of trials until the first 
success. Notation: X~ G (p). The mean is ]i — ^ and the standard deviation is 

i • ( i — 1 J The probability of exactly x failures before the first success is given by the 
formula: P (X = x) = p (1 - p)* _1 . 

H Hypergeometric Distribution 

A discrete random variable (RV) that is characterized by 

• A fixed number of trials. 

• The probability of success is not the same from trial to trial. 

We sample from two groups of items when we are interested in only one group. X is defined as 
the number of successes out of the total number of items chosen. Notation: X~H (r, b,n) ., 
where r = the number of items in the group of interest, b = the number of items in the group not 
of interest, and n = the number of items chosen. 

Hypothesis 

A statement about the value of a population parameter. In case of two hypotheses, the statement 
assumed to be true is called the null hypothesis (notation Hq) and the contradictory statement is 
called the alternate hypothesis (notation H a ). 

Hypothesis Testing 

Based on sample evidence, a procedure to determine whether the hypothesis stated is a 
reasonable statement and cannot be rejected, or is unreasonable and should be rejected. 

I Independent Events 

The occurrence of one event has no effect on the probability of the occurrence of any other event. 
Events A and B are independent if one of the following is true: (1). P (A\B) = P (A) ; (2) 
P (B\A) =P(B); (3) P (AandB) = P (A) P (B). 

Inferential Statistics 

Also called statistical inference or inductive statistics. This facet of statistics deals with 
estimating a population parameter based on a sample statistic. For example, if 4 out of the 100 
calculators sampled are defective we might infer that 4 percent of the production is defective. 

Interquartile Range (IRQ) 

The distance between the third quartile (Q3) and the first quartile (Ql). IQR = Q3 - Ql. 

L Level of Significance of the Test 
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Probability of a Type I error (reject the null hypothesis when it is true). Notation: a. In 
hypothesis testing, the Level of Significance is called the preconceived a. or the preset a. 

M Mean 

A number that measures the central tendency. A common name for mean is 'average.' The term 
'mean' is a shortened form of 'arithmetic mean.' By definition, the mean for a sample (denoted 

, _i . _ Sum of all values in the sample , .1 c ii-/j a j i \ • 

by x) is x = Numberofvaluesinthesam F ple , and the mean for a population (denoted by p) is 

Sum of all values in the population 

" Number of values in the population ' 

Mean 

A number to measure the central tendency (average), shortening from arithmetic mean. By 

, c- ... ,i c i / ii j j. j i Tr\ ■ Tr Sum of all values in the sample j 

definition, the mean for a sample (usually denoted by X) is X = Number f values in the sample - and 

., .- it-/ iij t ji \ • Sum of all values in the population 

the mean for a population (usually denoted by m) is m = Number of va i ues in the population • 

Median 

A number that separates ordered data into halves. Half the values are the same number or 
smaller than the median and half the values are the same number or larger than the median. 
The median may or may not be part of the data. 

Mode 

The value that appears most frequently in a set of data. 

Mutually Exclusive 

An observation cannot fall into more than one class (category). Being in more than one category 
prevents being in a mutually exclusive category. 

Mutually Exclusive 

An observation cannot fall into more than one class (category). Being in one category prevents 
being in a mutually exclusive category. 

N Normal Distribution 

A continuous random variable (RV) with pdf f(x) = — j—e~^ x ~ ,i > 2 /2cr , where ]i is the mean of 

the distribution and a is the standard deviation. Notation: X ~ N (fi, cr) . If \i = and C — 1, the 
RV is called the standard normal distribution. 

Normal Distribution 

A continuous random variable (RV) with pdf = — j=e~( x ~V' 2 /2a 2 , where ji is the mean of the 

distribution and cr is its standard deviation. Notation: X ~ N (ji, a 2 ) . If }i = and cr — 1, the RV 
is called standard normal distribution, or z-score. 

O Outcome (observation) 

A particular result of an experiment. 
Outlier 

An observation that does not fit the rest of the data. 

P p-value 

The probability that an event will happen purely by chance assuming the null hypothesis is true. 
The smaller the p-value, the stronger the evidence is against the null hypothesis. 
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Parameter 

A numerical characteristic of the population. 
Percentile 

A number that divides ordered data into hundredths. 

Example: Let a data set contain 200 ordered observations starting with {2.3, 2.7, 2.8, 2.9, 2.9, 3.0...}. 

(1 7-1-2 Si 

Then the first percentile is 2 = 2-75, because 1% of the data is to the left of this point on 

the number line and 99% of the data is on its right. The second percentile is 2 = ^-9- 
Percentiles may or may not be part of the data. In this example, the first percentile is not in the 
data, but the second percentile is. The median of the data is the second quartile and the 50th 
percentile. The first and third quartiles are the 25th and the 75th percentiles, respectively. 

Point Estimate 

A single number computed from a sample and used to estimate a population parameter. 

Poisson Distribution 

A discrete random variable (RV) that counts the number of times a certain event will occur in a 
specific interval. Characteristics of the variable: 

• The probability that the event occurs in a given interval is the same for all intervals. 

• The events occur with a known mean and independently of the time since the last event. 

The distribution is defined by the mean \i of the event in the interval. Notation: X~P (pi). The 
mean is \l = np. The standard deviation is c = ^/Jl. The probability of having exactly x 

successes in r trials is P (X = x) = e~^jr- The Poisson distribution is often used to approximate 
the binomial distribution when n is "large" and p is "small" (a general rule is that n should be 
greater than or equal to 20 and p should be less than or equal to .05). 

Population 

The collection, or set, of all individuals, objects, or measurements whose properties are being 
studied. 

Probability 

A number between and 1, inclusive, that gives the likelihood that a specific event will occur. 
More exact, the foundation of statistics are given by the following 3 axioms (by A.N. 
Kolmogorov, 1930's): Let S denote the sample space, A and B are any two events in S . Then: 
(1). < P (A) < 1; (2). If A and B are any two mutually exclusive events, then 
P (AorB) = P (A) + P (B) ; (3). P (S) = 1 . 

Probability 

A number between and 1, inclusive, that gives the likelihood that a specific event will occur. 
The foundation of statistics is given by the following 3 axioms (by A. N. Kolmogorov, 1930's): 
Let S denote the sample space and A and B are two events in S . Then: 

•0 <P(A) < 1;. 

• If A and B are any two mutually exclusive events, then P (AorB) — P (A) + P (B). 

• P(S) = 1. 

Probability Distribution Function (PDF) 

A mathematical description of a discrete random variable (RV), given either in the form of an 
equation (formula) , or in the form of a table listing all the possible outcomes of an experiment 
and the probability associated with each outcome. 
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Example: A biased coin with probability 0.7 for a head (in one toss of the coin) is tossed 5 times. 
We are interested in the number of heads (the RV X = the number of heads). X is Binomial, so 

, 5 
X - 5(5,0.7) and P (X = x) 



7 X 3 5 x or in the form of the table: 



X 


P(X = x) 





0.0024 


1 


0.0284 


2 


0.1323 


3 


0.3087 


4 


0.3602 


5 


0.1681 



Table 4.3 



Proportion 



• As a number: A proportion is the number of successes divided by the total number in the 
sample. 

• As a probability distribution: Given a binomial random variable (RV), X ~B (n, p), consider 
the ratio of the number X of successes in n Bernouli trials to the number n of trials. P' = \- 
This new RV is called a proportion, and if the number of trials, n, is large enough, P' 

~N(p,H). 

Q Qualitative Data 

See Data. 

Quantitative Data 

Quartiles 

The numbers that separate the data into quarters. Quartiles may or may not be part of the data. 
The second quartile is the median of the data. 

R Random Variable (RV) 

see Variable 

Relative Frequency 

The ratio of the number of times a value of the data occurs in the set of all outcomes to the 
number of all outcomes. 



Sample 

A portion of the population understudy. A sample is representative if it characterizes the 
population being studied. 

Sample Space 

The set of all possible outcomes of an experiment. 

Standard Deviation 
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A number that is equal to the square root of the variance and measures how far data values are 
from their mean. Notation: s for sample standard deviation and a for population standard 
deviation. 

Standard Deviation 

A number that is equal to the square root of the variance and measures how far data values are 
from their mean. Notations: s for sample standard deviation and cfor population standard 
deviation. 

Standard Error of the Mean 

The standard deviation of the distribution of the sample means, -j= . 

Standard Normal Distribution 

A continuous random variable (RV) X~N (0, 1) .. When X follows the standard normal 
distribution, it is often noted as Z~N (0, 1). 

Statistic 

A numerical characteristic of the sample. A statistic estimates the corresponding population 
parameter. For example, the average number of full-time students in a 7:30 a.m. class for this 
term (statistic) is an estimate for the average number of full-time students in any class this term 
(parameter). 

Student's-t Distribution 

Investigated and reported by William S. Gossett in 1908 and published under the pseudonym 
Student. The major characteristics of the random variable (RV) are: 

• It is continuous and assumes any real values. 

• The pdf is symmetrical about its mean of zero. However, it is more spread out and flatter at 
the apex than the normal distribution. 

• It approaches the standard normal distribution as n gets larger. 

• There is a "family" of t distributions: every representative of the family is completely 
defined by the number of degrees of freedom which is one less than the number of data. 

Student-t Distribution 

Investigated and reported by William S. Gossett in 1908 and published under the pseudonym 
Student. The major characteristics of the random variable (RV) are: 

• It is a continuous and assumes any real values. 

• The pdf is symmetrical about its mean of zero. However, it is more spread out and flatter at 
the apex than the normal distribution. 

• It approaches the standard normal distribution as n gets larger. 

• There is a "family" of t distributions: every representative of family is completely defined 
by the number of degrees of freedom which is one less than the number of data. 

Student-t Distribution 

Investigated and reported by William S. Gossett in 1908 and published under the pseudonym 
Student. The major characteristics of the random variable (RV) are: 

• It is continuous and assumes any real values. 

• The pdf is symmetrical about its mean of zero. However, it is more spread out and flatter at 
the apex than the normal distribution. 

• It approaches the standard normal distribution as n gets larger. 

• There is a "family" of t distributions: every representative of the family is completely 
defined by the number of degrees of freedom which is one less than the number of data. 



GLOSSARY 337 

T Tree Diagram 

The useful visual representation of a sample space and events in the form of a "tree" with 
branches marked by possible outcomes simultaneously with associated probabilities 
(frequencies, relative frequencies). 

Type 1 Error 

The decision is to reject the Null hypothesis when, in fact, the Null hypothesis is true. 
Type 2 Error 

The decision is to not reject the Null hypothesis when, in fact, the Null hypothesis is false. 

U Uniform Distribution 

Continuous random variable (RV) that appears to have equally likely outcomes over the 
domain, a < x < b. Often referred as Rectangular distribution because graph of its pdf has 

form of rectangle. Notation: X~U (a, b). The mean is ]i = ^t^, and the variance is o 2 = - Ji+ , 
the probability density function is / (x) = ^r a i a < X < b, and cumulative distribution is 

P(X<x) = ^ a . 

V Variable (Random Variable) 

A characteristic of interest in a population being studied. Common notation for variables are 
upper case Latin letters X, Y, Z,...; common notation for a specific value from the domain (set of 
all possible values of a variable) are lower case Latin letters x, y, z,.... For example, if X is the 
number of children in a family, then x represents a specific integer 0, 1, 2, 3, .... Variables in 
statistics differ from variables in intermediate algebra in two following ways. 

• The domain of the random variable (RV) is not necessarily a numerical set; the domain may 
be expressed in words; for example, if X = hair color then the domain is (black, blond, gray, 
green, orange}. 

• We can tell what specific value x of the Random Variable X takes only after performing the 
experiment. 

Variance 

Mean of the squared deviations from the mean. Square of the standard deviation. 

Venn Diagram 

The visual representation of a sample space and events in the form of circles or ovals showing 
their intersections. 

Z z-score 

The linear transformation of the form z = ^-^- . If this transformation is applied to any normal 
distribution X~N [ji, a) , the result is the standard normal distribution Z~N (0, 1). If this 
transformation is applied to any specific value x of the RV with mean \i and standard deviation 
a , the result is called the z-score of x. Z-scores allow us to compare data that are normally 
distributed but scaled differently. 
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Index of Keywords and Terms 

Keywords are listed by the section with that keyword (page numbers are in parentheses). Keywords 
do not necessarily appear in the text of the page. They are merely associated with that section. Ex. 
apples, § 1.1 (1) Terms are referenced by the page they appear on. Ex. apples, 1 
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By: Roberta Bloom 

URL: http://cnx.0rg/content/ml8868/l.l/ 

Pages: 88-91 

Copyright: Roberta Bloom 

License: http: / / creativecommons.org/licenses/by /2.0/ 

Based on: Probability Topics: Independent & Mutually Exclusive Events 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.org/content/ml6837/L6/ 

Module: "Probability Topics: Two Basic Rules of Probability" 

Used here as: "Two Basic Rules of Probability" 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.org/content/ml6847/Lll/ 

Pages: 92-95 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /3.0/ 
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Module: "Probability Topics: Contingency Tables (modified R. Bloom)" 

Used here as: "Contingency Tables (modified R. Bloom)" 

By: Roberta Bloom 

URL: http://cnx.Org/content/ml8859/l.l/ 

Pages: 95-97 

Copyright: Roberta Bloom 

License: http: / / creativecommons.org/licenses/by /2.0/ 

Based on: Probability Topics: Contingency Tables 

By: Barbara Illowsky, Ph.D., Susan Dean 

URL: http://cnx.Org/content/ml6835/l.5/ 

Module: "Probability Topics: Venn Diagrams (optional)" 

Used here as: "Venn Diagrams (optional)" 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.Org/content/ml6848/l.12/ 

Pages: 97-99 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Module: "Probability Topics: Tree Diagrams (optional)" 

Used here as: "Tree Diagrams " 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.Org/content/ml6846/l.10/ 

Pages: 99-102 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /2.0/ 

Module: "Probability Topics: Summary of Formulas" 

Used here as: "Summary of Formulas" 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.Org/content/ml6843/l.5/ 

Page: 103 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Module: "Probability Topics: Practice 1: Contingency Tables (modified R. Bloom)" 

Used here as: "Practice 1: Contingency Tables (modified R. Bloom)" 

By: Roberta Bloom 

URL: http://cnx.Org/content/ml8926/l.l/ 

Pages: 104-105 

Copyright: Roberta Bloom 

License: http: / / creativecommons.org/licenses/by /2.0/ 

Based on: Probability Topics: Practice 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.Org/content/ml6839/l.8/ 

Module: "Probability Topics: Practice II" 

Used here as: "Practice 2: Calculating Probabilities" 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.Org/content/ml6840/l.12/ 

Page: 106 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /3.0/ 
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Module: "Probability Topics Homework Link (Collaborative Statistics; R. Bloom custom version)" 

Used here as: "Homework Link" 

By: Roberta Bloom 

URL: http://cnx.Org/content/ml9053/l.l/ 

Page: 107 

Copyright: Roberta Bloom 

License: http: / / creativecommons.org/licenses/by /2.0/ 

Module: "Chapter 3 Review Questions Link (Collaborative Statistics; R. Bloom custom version)" 

Used here as: "Review Questions Link" 

By: Roberta Bloom 

URL: http://cnx.Org/content/ml9032/l.l/ 

Page: 108 

Copyright: Roberta Bloom 

License: http: / / creativecommons.org/licenses/by /2.0/ 

Module: "Discrete Random Variables: Introduction" 

Used here as: "Discrete Random Variables" 

By: Susan Dean, Barbara Illowsky Ph.D. 

URL: http://cnx.Org/content/ml6825/l.14/ 

Pages: 111-112 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Module: "Discrete Random Variables: Probability Distribution Function (PDF) for a Discrete Random Vari- 
able" 

Used here as: "Probability Distribution Function (PDF) for a Discrete Random Variable" 
By: Susan Dean, Barbara Illowsky, Ph.D. 
URL: http://cnx.Org/content/ml6831/l.14/ 
Pages: 112-113 

Copyright: Maxfield Foundation 
License: http: / / creativecommons.org/licenses/by /3.0/ 

Module: "Discrete Random Variables: Mean or Expected Value and Standard Deviation" 

Used here as: "Mean or Expected Value and Standard Deviation" 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.Org/content/ml6828/l.16/ 

Pages: 113-116 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Module: "Discrete Random Variables: Common Discrete Probability Distribution Functions" 

Used here as: "Common Discrete Probability Distribution Functions" 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.org/content/ml6821/L6/ 

Page: 116 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /3.0/ 
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Module: "Discrete Random Variables: Binomial" 

Used here as: "Binomial Distribution" 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.Org/content/ml6820/l.16/ 

Pages: 116-119 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Module: "Discrete Random Variables: Geometric (optional)" 

Used here as: "Geometric Distribution" 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.org/content/ml6822/L16/ 

Pages: 119-121 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Module: "Discrete Random Variables: Hypergeometric (optional)" 

Used here as: "Hypergeometric Distribution (Optional)" 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.org/content/ml6824/L15/ 

Pages: 122-124 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Module: "Discrete Random Variables: Poisson (optional)" 

Used here as: "Poisson Distribution (Optional)" 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.org/content/ml6829/L16/ 

Pages: 124-126 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Module: "Discrete Random Variables: Summary of the Discrete Probability Functions" 

Used here as: "Summary of Functions" 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.org/content/ml6833/L10/ 

Pages: 127-128 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Module: "Discrete Random Variables: Practice 1: Discrete Distributions" 

Used here as: "Practice 1: Discrete Distribution" 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.org/content/ml6830/L14/ 

Page: 129 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /3.0/ 
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Module: "Discrete Random Variables: Practice 2: Binomial Distribution" 

Used here as: "Practice 2: Binomial Distribution" 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.Org/content/ml7107/l.18/ 

Pages: 130-131 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Module: "Discrete Random Variables: Practice 3: Poisson Distribution" 

Used here as: "Practice 3: Poisson Distribution" 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.org/content/ml7109/L15/ 

Page: 132 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Module: "Discrete Random Variables: Practice 4: Geometric Distribution" 

Used here as: "Practice 4: Geometric Distribution" 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.org/content/ml7108/L17/ 

Pages: 133-134 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Module: "Discrete Random Variables: Practice 5: Hypergeometric Distribution" 

Used here as: "Practice 5: Hypergeometric Distribution" 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.org/content/ml7106/L13/ 

Page: 135 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Module: "Discrete Random Variables Homework Link (Collaborative Statistics; R. Bloom custom version)" 

Used here as: "Homework Link" 

By: Roberta Bloom 

URL: http://cnx.org/content/ml9045/Ll/ 

Page: 136 

Copyright: Roberta Bloom 

License: http: / / creativecommons.org/licenses/by /2.0/ 

Module: "Chapter 4 Review Questions Link (Collaborative Statistics; R. Bloom custom version)" 

Used here as: "Review Questions Link" 

By: Roberta Bloom 

URL: http://cnx.org/content/ml9033/Ll/ 

Page: 137 

Copyright: Roberta Bloom 

License: http: / / creativecommons.org/licenses/by /2.0/ 
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Module: "Continuous Random Variables: Introduction (modified R. Bloom)" 

Used here as: "Introduction to Continuous Random Variables" 

By: Roberta Bloom 

URL: http://cnx.0rg/content/ml8866/l.l/ 

Page: 141 

Copyright: Roberta Bloom 

License: http: / / creativecommons.org/licenses/by /2.0/ 

Based on: Continuous Random Variables: Introduction 

By: Susan Dean, Barbara Illowsky Ph.D. 

URL: http://cnx.org/content/ml6808/L7/ 

Module: "Continuous Random Variables: Properties of Continuous Probability Distributions" 

Used here as: "Properties of Continuous Probability Distributions" 

By: Roberta Bloom 

URL: http://cnx.0rg/content/ml886O/l.l/ 

Pages: 141-143 

Copyright: Roberta Bloom 

License: http: / / creativecommons.org/licenses/by /2.0/ 

Module: "Continuous Random Variables: Continuous Probability Functions" 

Used here as: "Continuous Probability Functions" 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.Org/content/ml6805/l.9/ 

Pages: 143-145 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Module: "Continuous Random Variables: The Uniform Distribution (modified R. Bloom)" 

Used here as: "The Uniform Distribution (modified R. Bloom)" 

By: Roberta Bloom 

URL: http://cnx.Org/content/ml8925/l.2/ 

Pages: 146-152 

Copyright: Roberta Bloom 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Based on: Continuous Random Variables: The Uniform Distribution 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.Org/content/ml6819/l.U/ 

Module: "Continuous Random Variables: The Exponential Distribution" 

Used here as: "The Exponential Distribution" 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.org/content/ml6816/L15/ 

Pages: 152-157 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /3.0/ 
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Module: "Continuous Random Variables: Summary of The Uniform and Exponential Probability Distribu- 
tions" 

Used here as: "Summary of the Uniform and Exponential Probability Distributions" 
By: Susan Dean, Barbara Illowsky Ph.D. 
URL: http://cnx.Org/content/ml6813/l.10/ 
Page: 158 

Copyright: Maxfield Foundation 
License: http: / / creativecommons.org/licenses/by /2.0/ 

Module: "Continuous Random Variables: Practice 1 (modified R. Bloom)" 

Used here as: "Practice 1: Uniform Distribution (modified R. Bloom)" 

By: Roberta Bloom 

URL: http://cnx.org/content/ml8672/L2/ 

Pages: 159-161 

Copyright: Roberta Bloom 

License: http: / / creativecommons.org/licenses/by /2.0/ 

Based on: Continuous Random Variables: Practice 1 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.org/content/ml6812/L9/ 

Module: "Continuous Random Variables: Practice 2" 

Used here as: "Practice 2: Exponential Distribution" 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.0rg/content/ml68ll/l.U/ 

Pages: 162-163 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Module: "Continuous Distributions Homework Link (Collaborative Statistics; R. Bloom custom version)" 

Used here as: "Homework Link" 

By: Roberta Bloom 

URL: http://cnx.org/content/ml9043/Ll/ 

Page: 164 

Copyright: Roberta Bloom 

License: http: / / creativecommons.org/licenses/by /2.0/ 

Module: "Chapter 5 Review Questions Link (Collaborative Statistics; R. Bloom custom version)" 

Used here as: "Review Questions Link" 

By: Roberta Bloom 

URL: http://cnx.org/content/ml9034/Ll/ 

Page: 165 

Copyright: Roberta Bloom 

License: http: / / creativecommons.org/licenses/by /2.0/ 

Module: "Normal Distribution: Introduction" 

Used here as: "The Normal Distribution" 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.org/content/ml6979/L12/ 

Pages: 169-170 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /3.0/ 
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Module: "Normal Distribution: Standard Normal Distribution" 

Used here as: "The Standard Normal Distribution" 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.Org/content/ml6986/l.7/ 

Page: 170 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /2.0/ 

Module: "Normal Distribution: Z-scores" 

Used here as: "Z-scores" 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.Org/content/ml6991/l.9/ 

Pages: 171-172 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Module: "Normal Distribution: Areas to the Left and Right of x" 

Used here as: "Areas to the Left and Right of x" 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.Org/content/ml6976/l.5/ 

Page: 173 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /2.0/ 

Module: "Normal Distribution: Calculations of Probabilities" 

Used here as: "Calculations of Probabilities" 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.org/content/ml6977/L12/ 

Pages: 173-176 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Module: "Normal Distribution: Summary of Formulas" 

Used here as: "Summary of Formulas" 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.org/content/ml6987/L5/ 

Page: 177 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Module: "Normal Distribution: Practice" 

Used here as: "Practice: The Normal Distribution" 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.org/content/ml6983/L10/ 

Pages: 178-179 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /3.0/ 
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Module: "Normal Distribution Homework Link (Collaborative Statistics; R. Bloom custom version)" 

Used here as: "Homework Link" 

By: Roberta Bloom 

URL: http://cnx.Org/content/ml9050/l.l/ 

Page: 180 

Copyright: Roberta Bloom 

License: http: / / creativecommons.org/licenses/by /2.0/ 

Module: "Chapter 6 Review Questions Link (Collaborative Statistics; R. Bloom custom version)" 

Used here as: "Review Questions Link" 

By: Roberta Bloom 

URL: http://cnx.Org/content/ml9035/l.l/ 

Page: 181 

Copyright: Roberta Bloom 

License: http: / / creativecommons.org/licenses/by /2.0/ 

Module: "Central Limit Theorem: Introduction" 

Used here as: "The Central Limit Theorem" 

By: Susan Dean, Barbara Illowsky Ph.D. 

URL: http://cnx.Org/content/ml6953/l.17/ 

Pages: 183-184 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Module: "Central Limit Theorem: Central Limit Theorem for Sample Means" 

Used here as: "The Central Limit Theorem for Sample Means (Averages)" 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.org/content/ml6947/L23/ 

Pages: 184-186 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Module: "Central Limit Theorem: Central Limit Theorem for Sums" 

Used here as: "The Central Limit Theorem for Sums (Optional)" 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.org/content/ml6948/L16/ 

Pages: 187-188 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Module: "Central Limit Theorem: Using the C.L.T (modified R. Bloom)" 

Used here as: "Using the Central Limit Theorem (modified R. Bloom)" 

By: Roberta Bloom 

URL: http://cnx.org/content/ml8864/L4/ 

Pages: 188-192 

Copyright: Roberta Bloom 

License: http: / / creativecommons.org/licenses/by /2.0/ 

Based on: Central Limit Theorem: Using the Central Limit Theorem 

By: Barbara Illowsky, Ph.D., Susan Dean 

URL: http://cnx.org/content/ml6958/L6/ 
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Module: "Central Limit Theorem: Summary of Formulas" 

Used here as: "Summary of Formulas" 

By: Susan Dean, Barbara Illowsky Ph.D. 

URL: http://cnx.Org/content/ml6956/l.8/ 

Page: 193 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Module: "Central Limit Theorem: Practice (modified R. Bloom)" 

Used here as: "Practice: Central Limit Theorem (modified R. Bloom)" 

By: Roberta Bloom 

URL: http://cnx.org/content/ml8671/L2/ 

Pages: 194-195 

Copyright: Roberta Bloom 

License: http: / / creativecommons.org/licenses/by /2.0/ 

Based on: Central Limit Theorem: Practice 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.org/content/ml6954/L9/ 

Module: "Central Limit Theorem Homework Link" 

Used here as: "Homework Link" 

By: Roberta Bloom 

URL: http://cnx.org/content/ml9031/Ll/ 

Page: 196 

Copyright: Roberta Bloom 

License: http: / / creativecommons.org/licenses/by /2.0/ 

Module: "Chapter 7 Review Questions Link (Collaborative Statistics; R. Bloom custom version)" 

Used here as: "Review Questions Link" 

By: Roberta Bloom 

URL: http://cnx.org/content/ml9036/Ll/ 

Page: 197 

Copyright: Roberta Bloom 

License: http: / / creativecommons.org/licenses/by /2.0/ 

Module: "Confidence Intervals: Introduction" 

Used here as: "Confidence Intervals" 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.org/content/ml6967/L16/ 

Pages: 199-201 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /3.0/ 
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Module: "Confidence Intervals for a Population Mean, Population Standard Deviation Known, Normal 
(modified R. Bloom)" 
By: Roberta Bloom 

URL: http://cnx.Org/content/ml8937/l.2/ 
Pages: 201-207 
Copyright: Roberta Bloom 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Based on: Confidence Intervals: Confidence Interval, Single Population Mean, Population Standard Devia- 
tion Known, Normal 
By: Barbara Illowsky, Ph.D., Susan Dean 
URL: http://cnx.org/content/ml6962/L6/ 

Module: "Confidence Interval for a Population Mean, Standard Deviation Unknown, Student-T (modifed 
R. Bloom)" 
By: Roberta Bloom 

URL: http://cnx.org/content/ml8935/L2/ 
Pages: 207-210 
Copyright: Roberta Bloom 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Based on: Confidence Intervals: Confidence Interval, Single Population Mean, Standard Deviation Un- 
known, Student-T 

By: Barbara Illowsky, Ph.D., Susan Dean 
URL: http://cnx.org/content/ml6959/L6/ 

Module: "Confidence Interval for a Population Proportion (modified R. Bloom)" 

By: Roberta Bloom 

URL: http://cnx.org/content/ml8934/L2/ 

Pages: 210-213 

Copyright: Roberta Bloom 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Based on: Confidence Intervals: Confidence Interval for a Population Proportion 

By: Barbara Illowsky, Ph.D., Susan Dean 

URL: http://cnx.org/content/ml6963/L5/ 

Module: "Confidence Intervals: Summary of Formulas" 

Used here as: "Summary of Formulas" 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.org/content/ml6973/L8/ 

Page: 214 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Module: "Confidence Intervals: Practice 1" 

Used here as: "Practice 1: Confidence Intervals for Averages, Known Population Standard Deviation" 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.org/content/ml6970/L13/ 

Pages: 215-216 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /3.0/ 
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Module: "Confidence Intervals: Practice 2" 

Used here as: "Practice 2: Confidence Intervals for Averages, Unknown Population Standard Deviation" 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.Org/content/ml6971/l.14/ 

Pages: 217-218 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Module: "Confidence Intervals: Practice 3" 

Used here as: "Practice 3: Confidence Intervals for Proportions" 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.Org/content/ml6968/l.13/ 

Pages: 219-220 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Module: "Confidence Interval Homework Link (Collaborative Statistics; R. Bloom custom version)" 

Used here as: "Homework Link" 

By: Roberta Bloom 

URL: http://cnx.Org/content/ml9042/l.l/ 

Page: 221 

Copyright: Roberta Bloom 

License: http: / / creativecommons.org/licenses/by /2.0/ 

Module: "Chapter 8 Review Questions Link (Collaborative Statistics; R. Bloom custom version)" 

Used here as: "Review Questions Link" 

By: Roberta Bloom 

URL: http://cnx.Org/content/ml9037/l.l/ 

Page: 222 

Copyright: Roberta Bloom 

License: http: / / creativecommons.org/licenses/by /2.0/ 

Module: "Hypothesis Testing of Single Mean and Single Proportion: Introduction" 

Used here as: "Hypothesis Testing: Single Mean and Single Proportion" 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.Org/content/ml6997/l.ll/ 

Pages: 225-226 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Module: "Hypothesis Testing of Single Mean and Single Proportion: Null and Alternate Hypotheses" 

Used here as: "Null and Alternate Hypotheses" 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.org/content/ml6998/L14/ 

Pages: 226-227 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /3.0/ 
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Module: "Hypothesis Testing of Single Mean and Single Proportion: Outcomes and the Type I and Type II 

Errors" 

Used here as: "Outcomes and the Type I and Type II Errors" 

By: Susan Dean, Barbara Illowsky Ph.D. 

URL: http://cnx.Org/content/ml7006/l.8/ 

Pages: 227-228 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Module: "Hypothesis Testing of Single Mean and Single Proportion: Distribution Needed for Hypothesis 

Testing" 

Used here as: "Distribution Needed for Hypothesis Testing" 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.org/content/ml7017/L13/ 

Pages: 228-229 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Module: "Hypothesis Testing of Single Mean and Single Proportion: Assumptions" 

Used here as: "Assumption" 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.org/content/ml7002/L16/ 

Page: 229 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Module: "Hypothesis Testing of Single Mean and Single Proportion: Rare Events" 

Used here as: "Rare Events" 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.org/content/ml6994/L8/ 

Page: 229 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Module: "Hypothesis Testing of Single Mean and Single Proportion: Using the Sample to Test the Null 

Hypothesis" 

Used here as: "Using the Sample to Support One of the Hypotheses" 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.org/content/ml6995/L17/ 

Pages: 230-231 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Module: "Hypothesis Testing of Single Mean and Single Proportion: Decision and Conclusion" 

Used here as: "Decision and Conclusion" 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.org/content/ml6992/Lll/ 

Page: 231 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /3.0/ 



360 ATTRIBUTIONS 

Module: "Hypothesis Testing of Single Mean and Single Proportion: Additional Information" 

Used here as: "Additional Information" 

By: Susan Dean, Barbara Illowsky Ph.D. 

URL: http://cnx.Org/content/ml6999/l.13/ 

Pages: 231-232 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Module: "Hypothesis Testing of Single Mean and Single Proportion: Summary of the Hypothesis Test" 

Used here as: "Summary of the Hypothesis Test" 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.Org/content/ml6993/l.6/ 

Page: 233 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Module: "Hypothesis Testing of Single Mean and Single Proportion: Examples" 

Used here as: "Examples" 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.org/content/ml7005/L25/ 

Pages: 233-243 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Module: "Hypothesis Testing of Single Mean and Single Proportion: Summary of Formulas" 

Used here as: "Summary of Formulas" 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.org/content/ml6996/L9/ 

Page: 244 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Module: "Hypothesis Testing of Single Mean and Single Proportion: Practice 1" 

Used here as: "Practice 1: Single Mean, Known Population Standard Deviation" 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.org/content/ml7004/Lll/ 

Pages: 245-246 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Module: "Hypothesis Testing of Single Mean and Single Proportion: Practice 2" 

Used here as: "Practice 2: Single Mean, Unknown Population Standard Deviation" 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.org/content/ml7016/L12/ 

Pages: 247-248 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /3.0/ 
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Module: "Hypothesis Testing of Single Mean and Single Proportion: Practice 3" 

Used here as: "Practice 3: Single Proportion" 

By: Susan Dean, Barbara Illowsky Ph.D. 

URL: http://cnx.Org/content/ml7003/l.15/ 

Pages: 249-250 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Module: "Hypothesis Test Homework Link (Collaborative Statistics; R. Bloom custom version)" 

Used here as: "Homework Link" 

By: Roberta Bloom 

URL: http://cnx.Org/content/ml9047/l.l/ 

Page: 251 

Copyright: Roberta Bloom 

License: http: / / creativecommons.org/licenses/by /2.0/ 

Module: "Chapter 9 Review Questions Link (Collaborative Statistics; R. Bloom custom version)" 

Used here as: "Review Questions Link" 

By: Roberta Bloom 

URL: http://cnx.Org/content/ml9038/l.l/ 

Page: 252 

Copyright: Roberta Bloom 

License: http: / / creativecommons.org/licenses/by /2.0/ 

Module: "Hypothesis Testing: Two Population Means and Two Population Proportions: Introduction" 

Used here as: "Hypothesis Testing: Two Population Means and Two Population Proportions" 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.Org/content/ml7029/l.9/ 

Pages: 255-256 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Module: "Hypothesis Testing: Two Population Means and Two Population Proportions: Comparing Two 
Independent Population Means with Unknown Population Standard Deviations" 

Used here as: "Comparing Two Independent Population Means with Unknown Population Standard Devi- 
ations" 

By: Susan Dean, Barbara Illowsky, Ph.D. 
URL: http://cnx.org/content/ml7025/L18/ 
Pages: 256-259 

Copyright: Maxfield Foundation 
License: http: / / creativecommons.org/licenses/by /3.0/ 

Module: "Hypothesis Testing: Two Population Means and Two Population Proportions: Comparing Two 
Independent Population Means with Known Population Standard Deviations" 

Used here as: "Comparing Two Independent Population Means with Known Population Standard Devia- 
tions" 

By: Susan Dean, Barbara Illowsky, Ph.D. 
URL: http://cnx.org/content/ml7042/L10/ 
Pages: 259-261 

Copyright: Maxfield Foundation 
License: http: / / creativecommons.org/licenses/by /3.0/ 
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Module: "Hypothesis Testing: Two Population Means and Two Population Proportions: Comparing Two 

Independent Population Proportions" 

Used here as: "Comparing Two Independent Population Proportions" 

By: Susan Dean, Barbara Illowsky Ph.D. 

URL: http://cnx.Org/content/ml7043/l.12/ 

Pages: 261-263 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Module: "Hypothesis Testing: Two Population Means and Two Population Proportions: Matched or Paired 

Samples" 

Used here as: "Matched or Paired Samples" 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.Org/content/ml7033/l.15/ 

Pages: 263-267 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Module: "Hypothesis Testing: Two Population Means and Two Population Proportions: Summary of Types 

of Hypothesis Tests" 

Used here as: "Summary of Types of Hypothesis Tests" 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.org/content/ml7044/L5/ 

Page: 268 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /2.0/ 

Module: "Hypothesis Testing: Two Population Means and Two Population Proportions: Practice 1" 

Used here as: "Practice 1: Hypothesis Testing for Two Proportions" 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.org/content/ml7027/L13/ 

Pages: 269-270 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Module: "Hypothesis Testing: Two Population Means and Two Population Proportions: Practice 2" 

Used here as: "Practice 2: Hypothesis Testing for Two Averages" 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.org/content/ml7039/L12/ 

Pages: 271-272 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Module: "Hypothesis Test (2 means; paired samples; 2 proportions) Homework Link (Collaborative Statis- 
tics; R. Bloom custom version)" 
Used here as: "Homework Link" 
By: Roberta Bloom 

URL: http://cnx.org/content/ml9046/Ll/ 
Page: 273 

Copyright: Roberta Bloom 
License: http: / / creativecommons.org/licenses/by /2.0/ 
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Module: "Chapter 10 Review Questions Link (Collaborative Statistics; R. Bloom custom version)" 

Used here as: "Review Questions Link" 

By: Roberta Bloom 

URL: http://cnx.Org/content/ml9039/l.l/ 

Page: 274 

Copyright: Roberta Bloom 

License: http: / / creativecommons.org/licenses/by /2.0/ 

Module: "Linear Regression and Correlation: Introduction" 

Used here as: "Linear Regression and Correlation" 

By: Susan Dean, Barbara Illowsky Ph.D. 

URL: http://cnx.org/content/ml7089/L5/ 

Page: 277 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /2.0/ 

Module: "Linear Regression and Correlation: Linear Equations" 

Used here as: "Linear Equations" 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.org/content/ml7086/L4/ 

Pages: 277-279 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /2.0/ 

Module: "Linear Regression and Correlation: Slope and Y-Intercept of a Linear Equation" 

Used here as: "Slope and Y-Intercept of a Linear Equation" 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.org/content/ml7083/L5/ 

Page: 279 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /2.0/ 

Module: "Linear Regression and Correlation: Scatter Plots" 

Used here as: "Scatter Plots" 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.org/content/ml7082/L6/ 

Pages: 280-281 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /2.0/ 

Module: "Linear Regression and Correlation: The Regression Equation (modified R. Bloom)" 

Used here as: "The Regression Equation (modified R. Bloom)" 

By: Roberta Bloom 

URL: http://cnx.org/content/m33267/L2/ 

Pages: 282-287 

Copyright: Roberta Bloom 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Based on: Linear Regression and Correlation: The Regression Equation 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.org/content/ml7090/L8/ 
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Module: "Linear Regression and Correlation: The Correlation Coefficient and Coefficient of Determination 

(modified R. Bloom)" 

Used here as: "Correlation Coefficient and Coefficient of Determination" 

By: Roberta Bloom 

URL: http://cnx.Org/content/m33269/l.2/ 

Pages: 287-288 

Copyright: Roberta Bloom 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Based on: Linear Regression and Correlation: The Correlation Coefficient 

By: Susan Dean, Barbara Illowsky Ph.D. 

URL: http://cnx.org/content/ml7092/L6/ 

Module: "Linear Regression and Correlation: Testing the Significance of the Correlation Coefficient (modi- 
fied R. Bloom)" 

Used here as: "Facts About the Correlation Coefficient for Linear Regression (modified R. Bloom)" 
By: Roberta Bloom 

URL: http://cnx.org/content/m33270/L2/ 
Pages: 289-293 
Copyright: Roberta Bloom 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Based on: Linear Regression and Correlation: Facts About the Correlation Coefficient for Linear Regression 
By: Susan Dean, Barbara Illowsky, Ph.D. 
URL: http://cnx.org/content/ml7077/L7/ 

Module: "Linear Regression and Correlation: Prediction (modified R. Bloom)" 

Used here as: "Prediction (modified R. Bloom)" 

By: Roberta Bloom 

URL: http://cnx.org/content/m33268/Ll/ 

Page: 294 

Copyright: Roberta Bloom 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Based on: Linear Regression and Correlation: Prediction 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.org/content/ml7095/L6/ 

Module: "Linear Regression and Correlation: Outliers (modified R. Bloom)" 

Used here as: "Outliers (modified R. Bloom)" 

By: Roberta Bloom 

URL: http://cnx.org/content/m33271/Ll/ 

Pages: 294-299 

Copyright: Roberta Bloom 

License : http : / / creativecommons .org / licenses /by / 3 .0 / 

Based on: Linear Regression and Correlation: Outliers 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.org/content/ml7094/L7/ 

Module: "Linear Regression and Correlation: Summary" 

Used here as: "Summary" 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.org/content/ml7081/L4/ 

Page: 300 

Copyright: Maxfield Foundation 

License: http: / /creativecommons. org /licenses /by / 2.0/ 
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Module: "Linear Regression and Correlation: 95% Critical Values of the Sample Correlation Coefficient 

Table" 

Used here as: "95% Critical Values of the Sample Correlation Coefficient Table" 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.org/content/ml7098/L5/ 

Pages: 300-301 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /2.0/ 

Module: "Linear Regression and Correlation: Practice" 

Used here as: "Practice: Linear Regression" 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.org/content/ml7088/L8/ 

Pages: 302-304 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /2.0/ 

Module: "Linear Regression Homework Link (Collaborative Statistics; R. Bloom custom version)" 

Used here as: "Homework Link" 

By: Roberta Bloom 

URL: http://cnx.org/content/ml9048/L2/ 

Page: 305 

Copyright: Roberta Bloom 

License: http: / / creativecommons.org/licenses/by /2.0/ 

Module: "Collaborative Statistics: Data Sets" 

Used here as: "Data Sets" 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.org/content/ml7132/L5/ 

Pages: 307-309 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Module: "Collaborative Statistics: Symbols and their Meanings" 

Used here as: "Symbols and their Meanings" 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.org/content/ml6302/L9/ 

Pages: 310-314 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /2.0/ 

Module: "Collaborative Statistics: English Phrases Written Mathematically" 

Used here as: "English Phrases Written Mathematically" 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.org/content/ml6307/L5/ 

Page: 315 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /2.0/ 
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Module: "Collaborative Statistics: Formulas" 

Used here as: "Formulas" 

By: Susan Dean, Barbara Illowsky, Ph.D. 

URL: http://cnx.Org/content/ml6301/l.7/ 

Pages: 316-317 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Module: "Collaborative Statistics: Notes for the TI-83, 83+, 84 Calculator" 

Used here as: "Notes for the TI-83, 83+, 84 Calculator " 

By: Barbara Illowsky, Ph.D., Susan Dean 

URL: http://cnx.Org/content/ml9710/l.6/ 

Pages: 318-327 

Copyright: Maxfield Foundation 

License: http: / / creativecommons.org/licenses/by /3.0/ 

Module: "Tables" 

Used here as: "Links to Probability Tables" 

By: Susan Dean 

URL: http://cnx.Org/content/ml9138/l.3/ 

Page: 328 

Copyright: Susan Dean 

License: http: / / creativecommons.org/licenses/by /2.0/ 



Collaborative Statistics: Custom Version modified by V Moyle 

This collection of Collaborative Statistics utilizes R. Bloom's revisions, but excludes Chapter 11 (Chi-Square 
Distribution) and Chapter 13 (F Distribution and ANOVA) for a shortened version of the original intro- 
ductory statistics course. Collaborative Statistics was written by Barbara Illowsky and Susan Dean, faculty 
members at De Anza College in Cupertino, California. The textbook was developed over several years and 
has been used in regular and honors-level classroom settings and in distance learning classes. This text- 
book is intended for introductory statistics courses being taken by students at two- and four-year colleges 
who are majoring in fields other than math or engineering. Intermediate algebra is the only prerequisite. 
The book focuses on applications of statistical knowledge rather than the theory behind it. This custom 
textbook collection of revisions by R. Bloom has been modified by V. Moyle for her classes at Bellingham 
Technical College; the homework content for the custom collection is now contained in a separate home- 
work collection. 
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